Capstone Completion Modeling - Swire Coca Cola Forecasting¶

Table Of Contents¶

Introduction: Business Problem Statement

Importing Libraries

1. Innovative Product

1.1 Demand forecasting on Manufacturer, Caloric Segment, Category and Brand

Data Preparation

Exponential Smoothing Model

Prophet Time Series Model

SARIMA Time Series Model

Results

1.2 Demand forecasting on Manufacturer, Caloric Segment and Flavor

Data Preparation

Exponential Smoothing Model

Prophet Time Series Model

SARIMA Time Series Model

Results

2. Innovative Product

2.1 Demand forecasting on Manufacturer, Caloric Segment, Category and Brand

Data Preparation

Prophet Time Series Modeling

Exponential Smoothing Modeling

ARIMA Modeling

Results

2.2 Demand forecasting on Flavor, Manufacturer, Category, Caloric Segment

Data Preparation

Prophet Time Series Modeling

Exponential Smoothing Modeling

ARIMA Modeling

Results

2.3 Demand forecasting on Flavor, Non-Swire Manufacturer, Category, and Caloric Segment

Data Preparation

Exponential Smoothing Modeling

Prophet Time Series Modeling

ARIMA Modeling

Results

3. Innovative Product

3.1 Demand forecasting on Brand, Manufacturer, Category, Caloric Segment in Southern Regions

Data Preparation

Prophet TimeSeries Modeling

Exponential Smoothing Model

Results

3.2 Demand forecasting on Package, Caloric Segment, Category and Manufacturer in Southern Regions

Data Preparation

Prophet TimeSeries Modeling

Results

3.3 Demand forecasting based on Package, Caloric Segment, Category and Non-Manufacturer

Data Preparation

Prophet TimeSeries Modeling

Results

4. Innovative Product

4.1 Demand forecasting on Package, Manufacturer, Category in the Northern region

Data Preparation

Prophet TimeSeries Modeling

Results

4.2 Demand forecasting on Package, Manufacturer, Category in the Southern region

Data Preparation

Prophet TimeSeries Modeling

Results

4.3 Demand forecasting on Category, Non-Manufacturer, and Package in North Region

Data Preparation

Prophet Time Series Model

Exponential Smoothing

Results

4.4 Demand forecasting on Category, Non-Manufacturer, and Package in Southern Region

Data Preparation

Prophet Time Series Model

Exponential Smoothing

Results

5. Innovative Product

5.1 Demand forecasting on Category, Manufacturer and Caloric Segment

Data Preparation

Exponential Smoothing Modeling

Results

5.2 Demand forecasting on Flavor, Non-Manufacturer, Caloric Segment

Data Preparation

Exponential Smoothing Modeling

Results

5.3 Demand forecasting based on Package, Manufacturer, Caloric Segment and Brand

Data Preparation

Exponential Smoothing Modeling

Results

6. Innovative Product

6.1 Demand forecasting on Caloric Segment, Category, Manufacturer and Brand

Data Preparation

Prophet Timeseries Modeling

Results

6.2 Demand forecasting on Caloric Segment, Flavor, Non-Manufacturer and Category

Data Preparation

Exponential Smoothing Modeling

Results

7. Innovative Product

7.1 Demand forecasting on Caloric Segment, Category, Manufacturer and Brand

Data Preparation

Prophet Timeseries Modeling

Results

7.2 Demand forecasting on Caloric Segment, Flavor, Non-Manufacturer and Category

Data Preparation

Exponential Smoothing Modeling

Results

Conclusion

Group Contribution

Introduction - Business Problem Statement ¶

image.png

Swire Coca-Cola, USA is responsible for the production, sale, and distribution of Coca-Cola and various beverages across 13 states in the American West. The company is committed to continuously introducing innovative products into the market. Swire aims to enhance its production planning and management specifically for these products. Forecasting the demand for each innovative product listed so that this guarantees efficient resource utilization.

The analytic approach we used for the modeling is:

Identify regular products that closely resemble the specified innovative products and forecast sales by leveraging the sales data of these similar products. Determine the most relevant similar products based on factors such as brand, market category, manufacturer, package type, and/or flavor, matching the specifications of the specified innovative products. Analyze the weekly sales figures of these similar products. Aggregate the sales data of these products to predict the sales of the innovative products.

In this notebook, the modeling provides valuable forecast into the sales trends of products across various sub-segments and segment combinations. Additionally, we analyze demographic data alongside product segmentation. The integration of Python, SQL using Google Big Query, and Tableau provides a insightful analysis, which further can be used for modeling analysis.

Importing Libraries ¶

In [ ]:
#Importing Libraries
!pip install numpy pandas matplotlib statsmodels prophet tensorflow
!pip install prophet
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

1. Innovative product ¶

Item Description: Diet Smash Plum 11Small 4One
Caloric Segment: Diet
Market Category: SSD
Manufacturer: Swire-CC
Brand: Diet Smash
Package Type: 11Small 4One
Flavor: ‘Plum’

Which 13 weeks of the year would this product perform best in the market?
What is the forecasted demand, in weeks, for those 13 weeks?

1.1 Demand forecasting on Manufacturer, Caloric Segment, Category and Brand ¶

We try filters with the category 'SSD' with Swire - CC, brand 'Diet Smash' and Diet/Light Caloric segment.

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Package Type: '11Small 4One' and Falavor: 'Plum'. So, we fist consider the other Caloric Segment: Diet, Market Category: SSD, Manufacturer: Swire-CC, and Brand: Diet Smash.

In [ ]:
# Required Authentications for the big query.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_41b06c35_18e92c55ff5') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE MANUFACTURER = 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
AND CATEGORY = 'SSD'
AND BRAND = 'DIET SMASH'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_41b06c35_18e92c55ff5') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-10-16 4629.0 13467.15
1 2021-04-24 2061.0 3308.90
2 2022-01-01 4601.0 9645.27
3 2021-04-03 2291.0 3620.47
4 2021-11-06 4228.0 9983.71
... ... ... ...
142 2021-11-27 4729.0 10618.19
143 2022-01-29 4442.0 14286.74
144 2021-01-23 2684.0 5300.92
145 2022-11-19 1747.0 10371.57
146 2022-07-23 2136.0 11672.57

147 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-10-16 4629.0 13467.15 2021 10 41
1 2021-04-24 2061.0 3308.90 2021 4 16
2 2022-01-01 4601.0 9645.27 2022 1 52
3 2021-04-03 2291.0 3620.47 2021 4 13
4 2021-11-06 4228.0 9983.71 2021 11 44
... ... ... ... ... ... ...
142 2021-11-27 4729.0 10618.19 2021 11 47
143 2022-01-29 4442.0 14286.74 2022 1 4
144 2021-01-23 2684.0 5300.92 2021 1 3
145 2022-11-19 1747.0 10371.57 2022 11 46
146 2022-07-23 2136.0 11672.57 2022 7 29

147 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Exponential Smoothing Model ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensuring the DATE column is in datetime format and setting as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
last_date = forecast_features.index.max()

# Preparing the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]

# Exponential Smoothing Forecast for the UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for the DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find out the best 13 weeks
def find_best_13_weeks(forecast):
    # Define rolling sum over a window of 13 weeks
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)  # 13 weeks include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks the UNIT_SALES
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for the DOLLAR_SALES
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function with adjustment for negative values
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))

    # Ensure no negative values in the forecast
    forecast_positive = forecast.clip(lower=0)

    plt.plot(forecast_positive.index, forecast_positive, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plotting the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

From the plot we can see that the best 13 weeks are from the November to January months for unit and June to September for dollar sales.

In [ ]:
# Defining the function to find out the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Finding the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Finding the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

exp_forecast.index.freq = 'W-SUN'  # Here we are assuming forecasts start on Sundays
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Printing out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 64186.768234957635
Best 13 weeks for dollar sales start on 2024-06-16 and end on 2024-09-08, with total sales: 580059.770161476
Best 13 weeks for Unit Sales:
2023-11-05    5687.332771
2023-11-12    5692.474623
2023-11-19    5510.433734
2023-11-26    5385.756328
2023-12-03    5346.653107
2023-12-10    4906.951640
2023-12-17    4634.071058
2023-12-24    4635.620448
2023-12-31    4851.530769
2024-01-07    4798.997773
2024-01-14    4211.564092
2024-01-21    3896.723106
2024-01-28    4628.658785
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2024-06-16    42130.221537
2024-06-23    42156.532409
2024-06-30    44000.197305
2024-07-07    43192.075601
2024-07-14    43080.532361
2024-07-21    45498.123601
2024-07-28    46176.066906
2024-08-04    48598.681192
2024-08-11    48579.299690
2024-08-18    44776.838333
2024-08-25    45696.473122
2024-09-01    43551.991290
2024-09-08    42622.736814
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['June', 'July', 'August', 'September'], dtype='object')

The total sales of these products in these 13 weeks are 64186. And the dollar sales are 580059.

Let's evaluate the performance of the model using Mean Absolute Error (MAE) and Mean Squared Error (MSE).

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Defining the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fitting the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generating forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculating MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeating the process for the DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
In [ ]:
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 4169.322570662927, MSE: 21871252.498852786
DOLLAR_SALES - MAE: 9752.579068279909, MSE: 123352246.47961982

We can see that the MAE for the unit sales model is 4169 and for the dollar sales is 9752, which is quite high.

So, let's analyze some other model and decide which model is the best.

Prophet Time Series Model ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Preparing the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['UNIT_SALES']].reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DOLLAR_SALES']].reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fitting the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fitting the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Creating a future dataframe for one year and make predictions
future = prophet_model_unit.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future)
forecast_dollar = prophet_model_dollar.predict(future)

# Function to find out the best 13 weeks within the forecast period
def find_best_13_weeks(forecast):
    forecast['rolling_sum'] = forecast['yhat'].rolling(window=91, min_periods=1, center=True).sum()
    best_period_idx = forecast['rolling_sum'].idxmax()
    best_period_start = forecast.iloc[best_period_idx - 91//2]['ds']
    best_period_end = forecast.iloc[best_period_idx + 91//2]['ds']
    return best_period_start, best_period_end

# Finding the best 13 weeks for the UNIT_SALES and DOLLAR_SALES
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plotting the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_9ytmngt.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_4keyjxu.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=29624', 'data', 'file=/tmp/tmpu6u1ud2o/_9ytmngt.json', 'init=/tmp/tmpu6u1ud2o/_4keyjxu.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model32p_r4u3/prophet_model-20240331082642.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:26:42 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:26:42 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/0o_jy1wr.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/megbq1dw.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=3164', 'data', 'file=/tmp/tmpu6u1ud2o/0o_jy1wr.json', 'init=/tmp/tmpu6u1ud2o/megbq1dw.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model9b5f7r5r/prophet_model-20240331082643.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:26:43 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:26:43 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales are July to September and for dollar sales August to October.

Now let's evaluate the model performance using the MAE and MSE.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # 80/20 split
train = forecast_features.iloc[:split_point].copy()
test = forecast_features.iloc[split_point:].copy()

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fitting the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])
# Generating forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculating MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeating the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ld9evsuv.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/4m_5uxui.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=91717', 'data', 'file=/tmp/tmpu6u1ud2o/ld9evsuv.json', 'init=/tmp/tmpu6u1ud2o/4m_5uxui.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelh2uop7_d/prophet_model-20240331082700.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:27:00 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:27:00 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/f52nmv8j.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/fam67dcg.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=54608', 'data', 'file=/tmp/tmpu6u1ud2o/f52nmv8j.json', 'init=/tmp/tmpu6u1ud2o/fam67dcg.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modeldtg4ehpk/prophet_model-20240331082700.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:27:00 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:27:00 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 2583.4325462688034, MSE: 9513128.831265468
DOLLAR_SALES - MAE: 26826.278910410318, MSE: 815968070.3461262

Now the MAE and MSE values are reduced when compared to the exponential smoothing model.

Let's also try with the SARIMA time series model.

SARIMA Time Series Model ¶

SARIMAX stands for Seasonal Autoregressive Integrated Moving Average with exogenous variables. It's a time series model that can handle external effects. SARIMAX is an extension of the ARIMA model, which is made up of two parts: the autoregressive term (AR) and the moving-average term (MA).

In [ ]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

# Sorting the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)

# Defining the SARIMA model for UNIT_SALES
sarima_model_unit = SARIMAX(forecast_features['UNIT_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_unit = sarima_model_unit.fit()

# Defining the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()

# Defining the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')

# Forecasting the next 52 periods (assuming weekly data)
sarima_forecast_unit = sarima_result_unit.get_forecast(steps=52).predicted_mean
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean

# Converting forecasts to pandas Series with a DateTimeIndex
sarima_forecast_unit = pd.Series(sarima_forecast_unit.values, index=forecast_dates)
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)

# Checking if rolling sum calculation is possible
rolling_sum = sarima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()

if pd.notnull(best_period_end):
    best_period_start = best_period_end - pd.DateOffset(weeks=12)

    # Plotting SARIMA forecast with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
    plt.plot(sarima_forecast_unit.index, sarima_forecast_unit, label='SARIMA Forecast', color='red')
    plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('SARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Unit Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

From the plot we can say that the best 13 weeks for the unit sales are from August to October.

In [ ]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

# Defining the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()

# Forecasting the next 52 periods (assuming weekly data)
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean

# Converting forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)

# Calculating the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = sarima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()

if pd.notnull(best_period_end_dollar):
    best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)

    # Plotting the SARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
    plt.plot(sarima_forecast_dollar.index, sarima_forecast_dollar, label='SARIMA Forecast', color='red')
    plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('SARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Dollar Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is ther")

From the plot we can say that the best 13 weeks for the dollar sales are from August to October.

In [ ]:
# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
    rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
    max_sum_index = rolling_sum.idxmax()
    max_sum_value = rolling_sum.max()
    start_of_best_period = max_sum_index - pd.DateOffset(weeks=12)  # 13 weeks including the end week
    return start_of_best_period, max_sum_index, max_sum_value

# Calculating for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(sarima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")

# Calculating for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(sarima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2024-07-21 to 2024-10-13, Total Sales: 98739.01788092876
Best 13 Weeks for Dollar Sales: 2024-07-28 to 2024-10-20, Total Sales: 737794.6046762894

The total sales for these best 13 weeks period is 98740 and total revenue is 737794 dollars.

Let's look at the performace of the model.

In [ ]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Function for Walk Forward Validation for time series
def sarima_walk_forward_validation(data, order, sorder, start_train, end_test, step=1):
    history = data[:start_train].tolist()
    predictions = []
    actual = []
    # Walk forward over time steps in test
    for i in range(start_train, end_test, step):
        model = SARIMAX(history, order=order, seasonal_order=sorder, enforce_stationarity=False, enforce_invertibility=False)
        model_fit = model.fit(disp=False)
        yhat = model_fit.forecast()[0]
        predictions.append(yhat)
        actual.append(data[i])
        history.append(data[i])
    mse = mean_squared_error(actual, predictions)
    mae = mean_absolute_error(actual, predictions)
    return mse, mae, predictions

order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)

# Adjusting these values based on the size of the dataset
start_train = int(len(forecast_features) * 0.7)  # Starting the training with 70% of the dataset
end_test = len(forecast_features)

unit_sales_data = forecast_features['UNIT_SALES']
dollar_sales_data = forecast_features['DOLLAR_SALES']

mse_unit, mae_unit, predictions_unit = sarima_walk_forward_validation(unit_sales_data, order, seasonal_order, start_train, end_test)
mse_dollar, mae_dollar, predictions_dollar = sarima_walk_forward_validation(dollar_sales_data, order, seasonal_order, start_train, end_test)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
  warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
  warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
UNIT_SALES: MSE=12138087.243539423, MAE=1080.6169480631572
DOLLAR_SALES: MSE=14950588.254797772, MAE=2971.096624988875
In [ ]:
print(f"UNIT_SALES: MSE={mse_unit}, MAE={mae_unit}")
print(f"DOLLAR_SALES: MSE={mse_dollar}, MAE={mae_dollar}")
UNIT_SALES: MSE=9513128.831265468, MAE=2583.4325462688034
DOLLAR_SALES: MSE=815968070.3461262, MAE=26826.278910410318

The MSE and MAE values decreased when compared to the other prophet and exponential smoothing models.

Results ¶

From the three models and their performance the best time for the sales are from August 21st to October 10th using the SARIMA model and their total sales are 98740 units and revenue is 737794 dollars.

1.2 Demand forecasting on Manufacturer, Caloric Segment and Flavor ¶

Now we try other filters with the flavour 'Plum' with Swire - CC and Diet/Light Caloric segment.

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Package Type: '11Small 4One'. So, we fist consider the other Caloric Segment: Diet, Market Category: SSD, Manufacturer: Swire-CC, and Flavor: 'Plum'

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_4684b0b3_18e96e150fb') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE ITEM LIKE '%PLUM%'
AND MANUFACTURER = 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_4684b0b3_18e96e150fb') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2022-02-26 860.0 805.31
1 2021-12-25 1276.0 1191.99
2 2020-12-12 1625.0 1505.14
3 2021-06-26 1635.0 1547.28
4 2022-04-16 1168.0 1060.76
... ... ... ...
135 2021-07-03 1617.0 1524.87
136 2023-05-27 955.0 1078.13
137 2022-08-13 1209.0 1316.36
138 2023-03-25 960.0 1109.77
139 2022-12-17 881.0 983.93

140 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2022-02-26 860.0 805.31 2022 2 8
1 2021-12-25 1276.0 1191.99 2021 12 51
2 2020-12-12 1625.0 1505.14 2020 12 50
3 2021-06-26 1635.0 1547.28 2021 6 25
4 2022-04-16 1168.0 1060.76 2022 4 15
... ... ... ... ... ... ...
135 2021-07-03 1617.0 1524.87 2021 7 26
136 2023-05-27 955.0 1078.13 2023 5 21
137 2022-08-13 1209.0 1316.36 2022 8 32
138 2023-03-25 960.0 1109.77 2023 3 12
139 2022-12-17 881.0 983.93 2022 12 50

140 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Here we get 140 rows of the filtered data from the google big query with Year and Month and Week columns.

Now let's do the modeling.

Exponential Smoothing Modeling ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Defining the last date in the DataFrame
last_date = forecast_features.index.max()

# Preparing the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find out the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for the unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for the dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    forecast = forecast.clip(lower=0)  # Ensure no negative values in the forecast
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plotting the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

The best 13 weeks for the sales are November to January end form the plot and the sales drop very big during second half of the year.

In [ ]:
# Defining the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Finding the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Finding the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

exp_forecast.index.freq = 'W-SUN'  # Assuming our forecasts start on Sundays
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 6506.903198230882
Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 7724.323341631012
Best 13 weeks for Unit Sales:
2023-11-05    870.166374
2023-11-12    824.413819
2023-11-19    680.349468
2023-11-26    769.461487
2023-12-03    669.312831
2023-12-10    561.813252
2023-12-17    446.329587
2023-12-24    410.856229
2023-12-31    398.273982
2024-01-07    327.720387
2024-01-14    271.632940
2024-01-21    154.051637
2024-01-28    122.521204
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2023-11-05    871.962416
2023-11-12    880.221959
2023-11-19    741.583458
2023-11-26    884.959862
2023-12-03    773.559473
2023-12-10    683.116360
2023-12-17    570.747949
2023-12-24    527.911276
2023-12-31    550.818957
2024-01-07    366.850624
2024-01-14    327.699192
2024-01-21    256.345834
2024-01-28    288.545981
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

The best 13 weeks start from the November 5th to 28th January for both unit sales and Dollar sales. The overall unit sales in this best 13 weeks period is 6506 and revenue is 7724 dollars.

Let's evaluate the performance of the model.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Defining the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fitting the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generating forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculating MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeating the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 1037.5271164522487, MSE: 1516515.6249574588
DOLLAR_SALES - MAE: 951.8799506669977, MSE: 1313122.1411227155

Here, the MAE is 1037 for unit sales and 951 for dollar sales. The MSE is 1516515 for unit sales and 1313122 for dollar sales which is quite high.

Let's look at the prophet time series model.

Prophet Time Series Modeling ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Preparing the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9um0mf3y.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/m04e5ozh.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=95592', 'data', 'file=/tmp/tmpu6u1ud2o/9um0mf3y.json', 'init=/tmp/tmpu6u1ud2o/m04e5ozh.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelho2pdmyy/prophet_model-20240331084344.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:43:44 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:43:44 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/pgmudq4i.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/92j2d2a1.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=38994', 'data', 'file=/tmp/tmpu6u1ud2o/pgmudq4i.json', 'init=/tmp/tmpu6u1ud2o/92j2d2a1.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modeljujtule_/prophet_model-20240331084344.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:43:44 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:43:44 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

The best 13 weeks for the product is August to November for both unit sales and dollar sales.

Let's evaluate the performance of the model.

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Now we calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeating the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/f75ygkv0.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/h76390om.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=54329', 'data', 'file=/tmp/tmpu6u1ud2o/f75ygkv0.json', 'init=/tmp/tmpu6u1ud2o/h76390om.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modellz9e0dp3/prophet_model-20240331084823.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:48:23 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:48:23 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/p2x9tl7m.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/0yumo_ly.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=26604', 'data', 'file=/tmp/tmpu6u1ud2o/p2x9tl7m.json', 'init=/tmp/tmpu6u1ud2o/0yumo_ly.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelr4tc7bdn/prophet_model-20240331084823.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:48:23 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:48:23 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 274.3332717188195, MSE: 93934.55080369273
DOLLAR_SALES - MAE: 375.8133333053362, MSE: 162370.21758420693

The MAE and MSE values are decreased when compared to the Exponential smoothing model.

Let's use the SARIMA model and evaluate it.

SARIMA Time Series Forecast Model ¶

SARIMAX stands for Seasonal Autoregressive Integrated Moving Average with exogenous variables. It's a time series model that can handle external effects. SARIMAX is an extension of the ARIMA model, which is made up of two parts: the autoregressive term (AR) and the moving-average term (MA).

In [ ]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)

# Define the SARIMA model for UNIT_SALES
sarima_model_unit = SARIMAX(forecast_features['UNIT_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_unit = sarima_model_unit.fit()

# Define the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()

# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')

# Forecast the next 52 periods (assuming weekly data)
sarima_forecast_unit = sarima_result_unit.get_forecast(steps=52).predicted_mean
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecasts to pandas Series with a DateTimeIndex
sarima_forecast_unit = pd.Series(sarima_forecast_unit.values, index=forecast_dates)
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)

# Check if rolling sum calculation is possible
rolling_sum = sarima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()

if pd.notnull(best_period_end):
    best_period_start = best_period_end - pd.DateOffset(weeks=12)

    # Plotting SARIMA forecast with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
    plt.plot(sarima_forecast_unit.index, sarima_forecast_unit, label='SARIMA Forecast', color='red')
    plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('SARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Unit Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks of the model is from November to January for unit sales.

In [ ]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt

# Define the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()

# Forecast the next 52 periods (assuming weekly data)
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)

# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = sarima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()

if pd.notnull(best_period_end_dollar):
    best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)

    # Plot the SARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
    plt.plot(sarima_forecast_dollar.index, sarima_forecast_dollar, label='SARIMA Forecast', color='red')
    plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('SARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Dollar Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks of the model is from November to January for the dollar sales.

In [ ]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Function for Walk Forward Validation for time series
def sarima_walk_forward_validation(data, order, sorder, start_train, end_test, step=1):
    history = data[:start_train].tolist()
    predictions = []
    actual = []
    # Walk forward over time steps in test
    for i in range(start_train, end_test, step):
        model = SARIMAX(history, order=order, seasonal_order=sorder, enforce_stationarity=False, enforce_invertibility=False)
        model_fit = model.fit(disp=False)
        yhat = model_fit.forecast()[0]
        predictions.append(yhat)
        actual.append(data[i])
        history.append(data[i]) # observation
    mse = mean_squared_error(actual, predictions)
    mae = mean_absolute_error(actual, predictions)
    return mse, mae, predictions

order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)

# Adjusting these values based on the size of your dataset
start_train = int(len(forecast_features) * 0.7)
end_test = len(forecast_features)

unit_sales_data = forecast_features['UNIT_SALES']
dollar_sales_data = forecast_features['DOLLAR_SALES']

mse_unit, mae_unit, predictions_unit = sarima_walk_forward_validation(unit_sales_data, order, seasonal_order, start_train, end_test)
mse_dollar, mae_dollar, predictions_dollar = sarima_walk_forward_validation(dollar_sales_data, order, seasonal_order, start_train, end_test)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
  warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
  warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
  np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
In [ ]:
print(f"UNIT_SALES: MSE={mse_unit}, MAE={mae_unit}")
print(f"DOLLAR_SALES: MSE={mse_dollar}, MAE={mae_dollar}")
UNIT_SALES: MSE=419357.07149860164, MAE=521.6442435461662
DOLLAR_SALES: MSE=277010.1154416626, MAE=398.3749394864194

The MAE and MSE values for the model are 521 and 419357 respectively for unit sales values and 398 and 277010 for dollar sales.

Results ¶

From the models we can see that the model provides with the is the best model with the best solutions and the months from November to January has the best 13 weeks.

2. Innovative Product ¶

Item Description: Diet Venomous Blast Energy Drink Kiwano 16 Liquid Small
Caloric Segment: Diet
Market Category: Energy
Manufacturer: Swire-CC
Brand: Venomous Blast
Package Type: 16 Liquid Small
Flavor: ’Kiwano’

Which 13 weeks of the year would this product perform best in the market?
What is the forecasted demand, in weeks, for those 13 weeks?

2.1 Demand forecasting on Manufacturer, Caloric Segment, Category and Brand¶

We try filters with the category 'Energy' with Swire - CC, brand 'Venomous Blast' and Diet/Light Caloric segment.

Data Preparation¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Package Type: '16 liquid small'. So, we fist consider the other Caloric Segment: Diet, Market Category: energy, Manufacturer: Swire-CC, and Brand: venomous blast.

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
# Running this code will display the query used to generate your previous job

job = client.get_job('bquxjob_42824de3_18e9731e74d') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE MANUFACTURER = 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
AND CATEGORY = 'ENERGY'
AND BRAND = 'VENOMOUS BLAST'
GROUP BY DATE;
In [ ]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_42824de3_18e9731e74d') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2022-08-20 3060.0 3343.27
1 2021-03-20 4584.0 4264.95
2 2021-10-30 3433.0 3102.08
3 2021-08-21 4333.0 4032.92
4 2021-03-27 3935.0 3694.53
... ... ... ...
134 2022-06-18 3231.0 3555.89
135 2021-07-10 3985.0 3689.87
136 2022-10-08 2869.0 3123.34
137 2022-09-24 2594.0 2842.74
138 2023-07-29 1767.0 1896.93

139 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracing relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2022-08-20 3060.0 3343.27 2022 8 33
1 2021-03-20 4584.0 4264.95 2021 3 11
2 2021-10-30 3433.0 3102.08 2021 10 43
3 2021-08-21 4333.0 4032.92 2021 8 33
4 2021-03-27 3935.0 3694.53 2021 3 12
... ... ... ... ... ... ...
134 2022-06-18 3231.0 3555.89 2022 6 24
135 2021-07-10 3985.0 3689.87 2021 7 27
136 2022-10-08 2869.0 3123.34 2022 10 40
137 2022-09-24 2594.0 2842.74 2022 9 38
138 2023-07-29 1767.0 1896.93 2023 7 30

139 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Prophet Time Series Modeling¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Preparing the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/0wtg_e7n.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/xlpmffl1.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=94615', 'data', 'file=/tmp/tmpshbwn_60/0wtg_e7n.json', 'init=/tmp/tmpshbwn_60/xlpmffl1.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelt5dp6tye/prophet_model-20240401010736.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:07:36 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:07:37 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/b417jcdc.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/5y7q7vwo.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=34443', 'data', 'file=/tmp/tmpshbwn_60/b417jcdc.json', 'init=/tmp/tmpshbwn_60/5y7q7vwo.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelymu7k9b6/prophet_model-20240401010737.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:07:37 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:07:37 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
<ipython-input-61-cb9e58d8778c>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
<ipython-input-61-cb9e58d8778c>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()

In the week of the months the of the November to January, the unit sales and dollar sales are the highest.

Lets evaluate model performance metrics

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Resetting index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/kk3zzpc6.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/s7bg3z4e.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=13272', 'data', 'file=/tmp/tmpshbwn_60/kk3zzpc6.json', 'init=/tmp/tmpshbwn_60/s7bg3z4e.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelfd5t083b/prophet_model-20240401010821.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:08:21 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:08:21 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/1zat6fjg.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/zkeo8nmu.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=3303', 'data', 'file=/tmp/tmpshbwn_60/1zat6fjg.json', 'init=/tmp/tmpshbwn_60/zkeo8nmu.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model9jdle0gm/prophet_model-20240401010822.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:08:22 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:08:22 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 184.12940565230494, MSE: 53759.618137118676
DOLLAR_SALES - MAE: 226.95544492763685, MSE: 91284.82288173026

The MAE and MSE values for unit sales are 184 and 53759.For dollor sales the respected values are 226 and 91284.

Exponential Smoothing Modeling¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame
last_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    forecast = forecast.clip(lower=0)
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

Here also the best sales are from the November to January for both dollar and unit sales.

In [ ]:
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 27089.59739698417
Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 25716.84464484714
Best 13 weeks for Unit Sales:
2023-11-05    2238.295810
2023-11-12    2374.804756
2023-11-19    2197.131115
2023-11-26    2089.043340
2023-12-03    2363.403338
2023-12-10    2428.576655
2023-12-17    2156.636819
2023-12-24    2098.744240
2023-12-31    1857.491347
2024-01-07    2357.688862
2024-01-14    2025.345332
2024-01-21    1532.255441
2024-01-28    1370.180342
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2023-11-05    2237.709606
2023-11-12    2322.717063
2023-11-19    2262.854639
2023-11-26    2060.134314
2023-12-03    2302.053302
2023-12-10    2361.575892
2023-12-17    2001.013098
2023-12-24    1908.227347
2023-12-31    1753.311512
2024-01-07    2080.857676
2024-01-14    1856.926463
2024-01-21    1386.450884
2024-01-28    1183.012849
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

The total sales in this 13 weeks are 27089 units and dollar sales are 25716 dollars.

Let's evaluate the model performance.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 239.7690087025342, MSE: 88172.0705520971
DOLLAR_SALES - MAE: 345.0714391443558, MSE: 184260.08837798057

The MAE and MSE values are 239 and 88172 for dollar sales and 345 and 184260 for the dollar sales.

ARIMA Modeling¶

ARIMA stands for Autoregressive Integrated Moving Average. It's a popular and powerful time series forecasting technique used for modeling and predicting time series data. ARIMA models are particularly effective for stationary time series data, meaning the statistical properties of the series such as mean and variance are constant over time.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)

# Define the ARIMA model for UNIT_SALES
arima_model_unit = ARIMA(forecast_features['UNIT_SALES'], order=(1, 1, 52))
arima_result_unit = arima_model_unit.fit()

# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()

# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')

# Forecast the next 52 periods (assuming weekly data)
arima_forecast_unit = arima_result_unit.get_forecast(steps=52).predicted_mean
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecasts to pandas Series with a DateTimeIndex
arima_forecast_unit = pd.Series(arima_forecast_unit.values, index=forecast_dates)
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)

# Check if rolling sum calculation is possible
rolling_sum = arima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()

if pd.notnull(best_period_end):
    best_period_start = best_period_end - pd.DateOffset(weeks=12)

    # Plot ARIMA forecast with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
    plt.plot(arima_forecast_unit.index, arima_forecast_unit, label='ARIMA Forecast', color='red')
    plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('ARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Unit Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks in ARIMA model is November to January have the best unit sales.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()

# Forecast the next 52 periods (assuming weekly data)
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)

# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = arima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()

if pd.notnull(best_period_end_dollar):
    best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)

    # Plot the ARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
    plt.plot(arima_forecast_dollar.index, arima_forecast_dollar, label='ARIMA Forecast', color='red')
    plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('ARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Dollar Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks in ARIMA model is November to January have the best dollar sales.

In [ ]:
import pandas as pd

# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
    rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
    max_sum_index = rolling_sum.idxmax()
    max_sum_value = rolling_sum.max()
    start_of_best_period = max_sum_index - pd.DateOffset(weeks=12)  # 13 weeks including the end week
    return start_of_best_period, max_sum_index, max_sum_value

# Calculate for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(arima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")

# Calculate for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(arima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2023-10-29 to 2024-01-21, Total Sales: 24357.8833125924
Best 13 Weeks for Dollar Sales: 2023-10-29 to 2024-01-21, Total Sales: 27014.166682422216
In [ ]:
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Assuming forecast_features is your dataframe with a datetime index and UNIT_SALES and DOLLAR_SALES columns
data_unit_sales = forecast_features['UNIT_SALES']
data_dollar_sales = forecast_features['DOLLAR_SALES']

# Number of observations to leave out in each split for testing
n_splits = 5

# The order and seasonal order for ARIMA/SARIMA model
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)

# Perform rolling forecast origin for unit sales
def rolling_forecast_origin(time_series, order, seasonal_order, n_splits):
    history = time_series.iloc[:-n_splits].tolist()
    predictions = []
    test_set = time_series.iloc[-n_splits:].tolist()

    for t in range(n_splits):
        model = ARIMA(history, order=order, seasonal_order=seasonal_order)
        model_fit = model.fit()
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        history.append(test_set[t])

    mae = mean_absolute_error(test_set, predictions)
    mse = mean_squared_error(test_set, predictions)
    return predictions, mae, mse

# Perform rolling forecast for UNIT_SALES
predictions_unit, mae_unit, mse_unit = rolling_forecast_origin(data_unit_sales, order, seasonal_order, n_splits)

# Perform rolling forecast for DOLLAR_SALES
predictions_unit, mae_dollar, mse_dollar = rolling_forecast_origin(data_dollar_sales, order, seasonal_order, n_splits)

# Print the evaluation
print(f'ARIMA model MAE for UNIT_SALES: {mae_unit}')
print(f'ARIMA model MAE for DOLLAR_SALES: {mae_dollar}')
print(f'ARIMA model MSE for UNIT_SALES: {mse_unit}')
print(f'ARIMA model MSE for DOLLAR_SALES: {mse_dollar}')
ARIMA model MAE for UNIT_SALES: 159.96594909611218
ARIMA model MAE for DOLLAR_SALES: 154.39527519479262
ARIMA model MSE for UNIT_SALES: 41398.72850037805
ARIMA model MSE for DOLLAR_SALES: 41749.16367557272

The MAE values have decreased when compared to the other models. So, this is quite the good model.

Results¶

From the models we can say that best 13 weeks are November to the January for the both dollar to unit sales for all the models used. And the best values are on 2023-11-05 and end on 2024-01-28 with 27089 unit sales and Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with dollar sales: 25716.

Since, we don't have combinations of flavour 'Kiwano' with brand of 'Venomous Blast', so now we use the sales of flavour without the brand.

2.2 Demand forecasting on Flavor, Manufacturer, Category, Caloric Segment¶

We will now try filters with the flavor 'Kiwano' with Swire - CC, Category 'Energy' and Diet/Light Caloric segment.

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Package Type: '16 liquid small'. So, we fist consider the other Caloric Segment: Diet, Market Category: energy, Manufacturer: Swire-CC, and flavour : kiwano

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
# Running this code will display the query used to generate your previous job

job = client.get_job('bquxjob_47bb0cd1_18e9734ee35') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE ITEM LIKE '%KIWANO%'
AND MANUFACTURER = 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
AND CATEGORY = 'ENERGY'
GROUP BY DATE;
In [ ]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_47bb0cd1_18e9734ee35') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2022-07-09 559.0 592.38
1 2023-02-11 620.0 637.37
2 2023-03-11 453.0 482.55
3 2023-02-04 399.0 413.73
4 2023-10-28 413.0 422.02
... ... ... ...
134 2021-09-11 805.0 703.29
135 2021-05-29 635.0 575.11
136 2023-04-08 433.0 468.86
137 2021-10-30 612.0 544.15
138 2021-02-13 568.0 520.90

139 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2022-07-09 559.0 592.38 2022 7 27
1 2023-02-11 620.0 637.37 2023 2 6
2 2023-03-11 453.0 482.55 2023 3 10
3 2023-02-04 399.0 413.73 2023 2 5
4 2023-10-28 413.0 422.02 2023 10 43
... ... ... ... ... ... ...
134 2021-09-11 805.0 703.29 2021 9 36
135 2021-05-29 635.0 575.11 2021 5 21
136 2023-04-08 433.0 468.86 2023 4 14
137 2021-10-30 612.0 544.15 2021 10 43
138 2021-02-13 568.0 520.90 2021 2 6

139 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Exponential Smoothing Model ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame
last_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    forecast = forecast.clip(lower=0)  # Ensure no negative values in the forecast
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

Here the best sales are from the November to January for both dollar and unit sales.

In [ ]:
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

# Since 'forecast_index' doesn't have the frequency set, let's define it to ensure we can perform the rolling operation.
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 8614.014741194007
Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 8415.144403697363
Best 13 weeks for Unit Sales:
2023-11-05    542.921586
2023-11-12    633.432077
2023-11-19    614.065669
2023-11-26    572.752860
2023-12-03    733.184863
2023-12-10    769.885958
2023-12-17    651.010705
2023-12-24    738.873109
2023-12-31    706.283460
2024-01-07    800.169949
2024-01-14    799.041546
2024-01-21    496.021773
2024-01-28    556.371186
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2023-11-05    556.013072
2023-11-12    613.811228
2023-11-19    613.488906
2023-11-26    573.750007
2023-12-03    713.214290
2023-12-10    753.815524
2023-12-17    645.691097
2023-12-24    701.922308
2023-12-31    682.957786
2024-01-07    774.060050
2024-01-14    755.137250
2024-01-21    488.390387
2024-01-28    542.892500
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

The total sales in this 13 weeks are 8614 units and dollar sales are 8415 dollars.

Let's evaluate the model performance.

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 78.47035335943062, MSE: 8222.930070753187
DOLLAR_SALES - MAE: 88.19229715697364, MSE: 10195.112930177784

The MAE and MSE values are 88 and 10195 for dollar sales and 78 and 8222 for the unit sales.

Prophet Model ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/hbz3mz33.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/n0u_mp2a.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=86890', 'data', 'file=/tmp/tmpshbwn_60/hbz3mz33.json', 'init=/tmp/tmpshbwn_60/n0u_mp2a.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeli7v1eqir/prophet_model-20240401011110.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:11:10 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:11:10 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/6f1w19i8.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/cq8vg3ym.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=25815', 'data', 'file=/tmp/tmpshbwn_60/6f1w19i8.json', 'init=/tmp/tmpshbwn_60/cq8vg3ym.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelooi9inad/prophet_model-20240401011110.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:11:10 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:11:10 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
<ipython-input-67-159f9d2a9899>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
<ipython-input-67-159f9d2a9899>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()

In the week of the months the of the october to december, the unit sales and dollar sales are the highest.

Lets evaluate model performance metrics

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nfedlo7e.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/fzmm9x9w.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=20188', 'data', 'file=/tmp/tmpshbwn_60/nfedlo7e.json', 'init=/tmp/tmpshbwn_60/fzmm9x9w.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelmnfy9409/prophet_model-20240401011221.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:12:21 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:12:21 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/c39bwxu0.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/h2uqdf4q.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=42207', 'data', 'file=/tmp/tmpshbwn_60/c39bwxu0.json', 'init=/tmp/tmpshbwn_60/h2uqdf4q.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelkd1y9xxn/prophet_model-20240401011221.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:12:21 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:12:21 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 51.40720349079181, MSE: 4829.472102122382
DOLLAR_SALES - MAE: 62.38062539059109, MSE: 7005.025188863927

The MAE and MSE values for unit sales are 51 and 4829.For dollor sales the respected values are 62 and 7005

ARIMA Model ¶

ARIMA stands for Autoregressive Integrated Moving Average. It's a popular and powerful time series forecasting technique used for modeling and predicting time series data. ARIMA models are particularly effective for stationary time series data, meaning the statistical properties of the series such as mean and variance are constant over time.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)

# Define the ARIMA model for UNIT_SALES
arima_model_unit = ARIMA(forecast_features['UNIT_SALES'], order=(1, 1, 52))
arima_result_unit = arima_model_unit.fit()

# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()

# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')

# Forecast the next 52 periods (assuming weekly data)
arima_forecast_unit = arima_result_unit.get_forecast(steps=52).predicted_mean
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecasts to pandas Series with a DateTimeIndex
arima_forecast_unit = pd.Series(arima_forecast_unit.values, index=forecast_dates)
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)

# Check if rolling sum calculation is possible
rolling_sum = arima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()

if pd.notnull(best_period_end):
    best_period_start = best_period_end - pd.DateOffset(weeks=12)

    # Plot ARIMA forecast with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
    plt.plot(arima_forecast_unit.index, arima_forecast_unit, label='ARIMA Forecast', color='red')
    plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('ARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Unit Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks in ARIMA model is November to January have the best unit sales.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()

# Forecast the next 52 periods (assuming weekly data)
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)

# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = arima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()

if pd.notnull(best_period_end_dollar):
    best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)

    # Plot the ARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
    plt.plot(arima_forecast_dollar.index, arima_forecast_dollar, label='ARIMA Forecast', color='red')
    plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('ARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Dollar Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks in ARIMA model is November to January have the best dollar sales.

In [ ]:
import pandas as pd

# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
    rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
    max_sum_index = rolling_sum.idxmax()
    max_sum_value = rolling_sum.max()
    start_of_best_period = max_sum_index - pd.DateOffset(weeks=12)  # 13 weeks including the end week
    return start_of_best_period, max_sum_index, max_sum_value

# Calculate for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(arima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")

# Calculate for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(arima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2023-11-05 to 2024-01-28, Total Sales: 6254.373421893271
Best 13 Weeks for Dollar Sales: 2023-10-29 to 2024-01-21, Total Sales: 8603.569286046462
In [ ]:
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Assuming forecast_features is your dataframe with a datetime index and UNIT_SALES and DOLLAR_SALES columns
data_unit_sales = forecast_features['UNIT_SALES']
data_dollar_sales = forecast_features['DOLLAR_SALES']

# Number of observations to leave out in each split for testing
n_splits = 5

# The order and seasonal order for ARIMA/SARIMA model
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)

# Perform rolling forecast origin for unit sales
def rolling_forecast_origin(time_series, order, seasonal_order, n_splits):
    history = time_series.iloc[:-n_splits].tolist()
    predictions = []
    test_set = time_series.iloc[-n_splits:].tolist()

    for t in range(n_splits):
        model = ARIMA(history, order=order, seasonal_order=seasonal_order)
        model_fit = model.fit()
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        history.append(test_set[t])

    mae = mean_absolute_error(test_set, predictions)
    mse = mean_squared_error(test_set, predictions)
    return predictions, mae, mse

# Perform rolling forecast for UNIT_SALES
predictions_unit, mae_unit, mse_unit = rolling_forecast_origin(data_unit_sales, order, seasonal_order, n_splits)

# Perform rolling forecast for DOLLAR_SALES
predictions_unit, mae_dollar, mse_dollar = rolling_forecast_origin(data_dollar_sales, order, seasonal_order, n_splits)

# Print the evaluation
print(f'ARIMA model MAE for UNIT_SALES: {mae_unit}')
print(f'ARIMA model MAE for DOLLAR_SALES: {mae_dollar}')
print(f'ARIMA model MSE for UNIT_SALES: {mse_unit}')
print(f'ARIMA model MSE for DOLLAR_SALES: {mse_dollar}')
ARIMA model MAE for UNIT_SALES: 60.95541428597327
ARIMA model MAE for DOLLAR_SALES: 66.25876265212688
ARIMA model MSE for UNIT_SALES: 3930.1638823873313
ARIMA model MSE for DOLLAR_SALES: 5237.383369388842

The MAE values have decreased when compared to the other models. So, this is quite the good model.

Results ¶

From the model the best model is ARIMA with low MAE and MSE values, we can say that the best 13 Weeks for unit sales are from 2023-11-05 to 2024-01-28 with total sales: 6254 and the best 13 Weeks for dollar sales are from 2023-10-29 to 2024-01-21 with total dollar sales: 8603.

Next, we analyze the sales of Non swire manufactures instead of the swire manufactures.

2.3 Demand forecasting on Flavor, Non-Swire Manufacturer, Category, and Caloric Segment¶

We will now try with filters on flavor 'Kiwano' with non-swire manufacturer, category 'Energy' and 'Diet/Light' Caloric Segment.

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Package Type: '16 liquid small'. So, we fist consider the other Caloric Segment: Diet, Market Category: energy, Manufacturer is not Swire-CC, and flavour : kiwano

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_406e83f8_18e97393a1a') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE ITEM LIKE '%KIWANO%'
AND MANUFACTURER != 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
AND CATEGORY = 'ENERGY'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_406e83f8_18e97393a1a') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-09-11 123092.00 257065.41
1 2022-01-08 103403.00 221574.93
2 2022-08-06 111003.00 256998.64
3 2023-02-25 77950.00 189635.07
4 2022-08-20 104504.00 256971.18
... ... ... ...
134 2023-02-18 81871.00 201203.89
135 2022-06-25 110180.00 254196.28
136 2021-10-23 120394.00 242164.06
137 2022-07-09 122341.00 259899.63
138 2023-10-28 71646.85 164146.62

139 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd

# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-09-11 123092.00 257065.41 2021 9 36
1 2022-01-08 103403.00 221574.93 2022 1 1
2 2022-08-06 111003.00 256998.64 2022 8 31
3 2023-02-25 77950.00 189635.07 2023 2 8
4 2022-08-20 104504.00 256971.18 2022 8 33
... ... ... ... ... ... ...
134 2023-02-18 81871.00 201203.89 2023 2 7
135 2022-06-25 110180.00 254196.28 2022 6 25
136 2021-10-23 120394.00 242164.06 2021 10 42
137 2022-07-09 122341.00 259899.63 2022 7 27
138 2023-10-28 71646.85 164146.62 2023 10 43

139 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Exponential Smoothing Modeling ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame
last_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    forecast = forecast.clip(lower=0)  # Ensure no negative values in the forecast
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

Here the best sales are from the November to January for both dollar and unit sales.

In [ ]:
# Defining the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

# Since 'forecast_index' doesn't have the frequency set, let's define it to ensure we can perform the rolling operation.
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 640494.8662601131
Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 1313210.6307216804
Best 13 weeks for Unit Sales:
2023-11-05    69740.235538
2023-11-12    66293.906130
2023-11-19    52157.082579
2023-11-26    43255.367610
2023-12-03    45781.059616
2023-12-10    49516.423206
2023-12-17    51097.188408
2023-12-24    50896.884348
2023-12-31    47854.218585
2024-01-07    48371.876137
2024-01-14    36942.797598
2024-01-21    40319.882426
2024-01-28    38267.944079
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2023-11-05    157196.628812
2023-11-12    149462.811153
2023-11-19    116709.370115
2023-11-26     98405.742345
2023-12-03     96340.505357
2023-12-10    101052.849146
2023-12-17    106224.644560
2023-12-24    100832.364426
2023-12-31     92506.248174
2024-01-07     90096.539788
2024-01-14     72240.792266
2024-01-21     69553.290389
2024-01-28     62588.844191
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['November', 'December', 'January'], dtype='object')

The total sales in this 13 weeks are 640494 units and dollar sales are 1313210 dollars.

Let's evaluate the model performance.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 22372.4380005534, MSE: 672118208.9165068
DOLLAR_SALES - MAE: 68485.10466434849, MSE: 5900371668.366554

The MAE and MSE values are 68485 and 5900371668 for dollar sales and 22372 and 672118208 for the unit sales.

Prophet Time Series Modeling¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/5tbswkao.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/ss_ra2dt.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=30402', 'data', 'file=/tmp/tmpshbwn_60/5tbswkao.json', 'init=/tmp/tmpshbwn_60/ss_ra2dt.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeln6c70_2i/prophet_model-20240401011549.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:15:49 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:15:49 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/kjshcr6a.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nrlfjbuz.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=47913', 'data', 'file=/tmp/tmpshbwn_60/kjshcr6a.json', 'init=/tmp/tmpshbwn_60/nrlfjbuz.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelcectwu2v/prophet_model-20240401011549.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:15:49 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:15:49 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
<ipython-input-76-776aca8e2f12>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
<ipython-input-76-776aca8e2f12>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()

The best 13 weeks in which the unit sales and dollar sales are the highest in the months from november to january.

Lets evaluate model performance metrics

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/u9pjjil2.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/n580nfkw.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=37008', 'data', 'file=/tmp/tmpshbwn_60/u9pjjil2.json', 'init=/tmp/tmpshbwn_60/n580nfkw.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeltxjq2ubn/prophet_model-20240401011555.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:15:55 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:15:55 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/s50mji75.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/ehw7alic.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=14274', 'data', 'file=/tmp/tmpshbwn_60/s50mji75.json', 'init=/tmp/tmpshbwn_60/ehw7alic.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelo2b1sxuj/prophet_model-20240401011555.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:15:55 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:15:55 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 3904.5456265075886, MSE: 22270963.418062758
DOLLAR_SALES - MAE: 107859.06277483127, MSE: 11735122700.572842

The MAE and MSE values for unit sales are 3904 and 22270963 . For dollor sales the respected values are 107859 and 11735122700.

ARIMA Modeling¶

ARIMA stands for Autoregressive Integrated Moving Average. It's a popular and powerful time series forecasting technique used for modeling and predicting time series data. ARIMA models are particularly effective for stationary time series data, meaning the statistical properties of the series such as mean and variance are constant over time.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)

# Define the ARIMA model for UNIT_SALES
arima_model_unit = ARIMA(forecast_features['UNIT_SALES'], order=(1, 1, 52))
arima_result_unit = arima_model_unit.fit()

# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()

# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')

# Forecast the next 52 periods (assuming weekly data)
arima_forecast_unit = arima_result_unit.get_forecast(steps=52).predicted_mean
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecasts to pandas Series with a DateTimeIndex
arima_forecast_unit = pd.Series(arima_forecast_unit.values, index=forecast_dates)
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)

# Check if rolling sum calculation is possible
rolling_sum = arima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()

if pd.notnull(best_period_end):
    best_period_start = best_period_end - pd.DateOffset(weeks=12)

    # Plot ARIMA forecast with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
    plt.plot(arima_forecast_unit.index, arima_forecast_unit, label='ARIMA Forecast', color='red')
    plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('ARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Unit Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks in which the unit sales are the highest in the months from november to january.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()

# Forecast the next 52 periods (assuming weekly data)
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean

# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)

# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = arima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()

if pd.notnull(best_period_end_dollar):
    best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)

    # Plot the ARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
    plt.figure(figsize=(10, 6))
    plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
    plt.plot(arima_forecast_dollar.index, arima_forecast_dollar, label='ARIMA Forecast', color='red')
    plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
    plt.title('ARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
    plt.xlabel('Date')
    plt.ylabel('Dollar Sales')
    plt.legend()
    plt.show()
else:
    print("No best period is there")

The best 13 weeks in which the dollar sales are the highest in the months from november to january.

In [ ]:
import pandas as pd

# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
    rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
    max_sum_index = rolling_sum.idxmax()
    max_sum_value = rolling_sum.max()
    start_of_best_period = max_sum_index - pd.DateOffset(weeks=12)  # 13 weeks including the end week
    return start_of_best_period, max_sum_index, max_sum_value

# Calculate for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(arima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")

# Calculate for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(arima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2023-10-29 to 2024-01-21, Total Sales: 935832.507085375
Best 13 Weeks for Dollar Sales: 2023-10-29 to 2024-01-21, Total Sales: 2148825.242396435
In [ ]:
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Assuming forecast_features is your dataframe with a datetime index and UNIT_SALES and DOLLAR_SALES columns
data_unit_sales = forecast_features['UNIT_SALES']
data_dollar_sales = forecast_features['DOLLAR_SALES']

# Number of observations to leave out in each split for testing
n_splits = 5

# The order and seasonal order for ARIMA/SARIMA model
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)

# Perform rolling forecast origin for unit sales
def rolling_forecast_origin(time_series, order, seasonal_order, n_splits):
    history = time_series.iloc[:-n_splits].tolist()
    predictions = []
    test_set = time_series.iloc[-n_splits:].tolist()

    for t in range(n_splits):
        model = ARIMA(history, order=order, seasonal_order=seasonal_order)
        model_fit = model.fit()
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        history.append(test_set[t])

    mae = mean_absolute_error(test_set, predictions)
    mse = mean_squared_error(test_set, predictions)
    return predictions, mae, mse

# Perform rolling forecast for UNIT_SALES
predictions_unit, mae_unit, mse_unit = rolling_forecast_origin(data_unit_sales, order, seasonal_order, n_splits)

# Perform rolling forecast for DOLLAR_SALES
predictions_unit, mae_dollar, mse_dollar = rolling_forecast_origin(data_dollar_sales, order, seasonal_order, n_splits)

# Print the evaluation
print(f'ARIMA model MAE for UNIT_SALES: {mae_unit}')
print(f'ARIMA model MAE for DOLLAR_SALES: {mae_dollar}')
print(f'ARIMA model MSE for UNIT_SALES: {mse_unit}')
print(f'ARIMA model MSE for DOLLAR_SALES: {mse_dollar}')
ARIMA model MAE for UNIT_SALES: 4932.447870350076
ARIMA model MAE for DOLLAR_SALES: 5904.8374217014525
ARIMA model MSE for UNIT_SALES: 32051026.15001712
ARIMA model MSE for DOLLAR_SALES: 60474279.28900906

The MAE values have decreased for both unit sales and dollor sales when compared to the other models. So, this is quite the good model.

Results ¶

From the models we can say that the best model is ARIMA model with best 13 weeks for unit sales from 2023-10-29 to 2024-01-21 with total sales: 935832. and best 13 Weeks for dollar sales from 2023-10-29 to 2024-01-21 with total dollar sales of 2148825. All the models gives the best 13 weeks are from November to January.

3. Innovative product ¶

Item Description: Peppy Gentle Drink Pink Woodsy .5L Multi Jug
Caloric Segment: Regular
Type: SSD
Manufacturer: Swire-CC
Brand: Peppy
Package Type: .5L Multi Jug
Flavor: ‘Pink Woodsy’

Swire plans to release this product in the Southern region for 13 weeks.
What will the forecasted demand be, in weeks, for this product?

3.1 Demand forecasting on Brand, Manufacturer, Category, Caloric Segment in Southern Regions¶

We first filter the brand 'Peppy' with 'Swire-cc', category 'SSD' and 'Regular' caloric segment in Southern Regions like KS, UT, CA, CO, AZ, NM, NV

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Pink Woodsy'. So, we fist consider the other Caloric Segment: Regular, Market Category: SSD, Manufacturer: Swire-CC, and Brand:Peppy.

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_2dbeaf40_18e972e90ec') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT 
    fmd.DATE,
    SUM(fmd.UNIT_SALES) AS UNIT_SALES, 
    SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'REGULAR'
    AND fmd.CATEGORY = 'SSD'
    AND fmd.BRAND = 'PEPPY'
    AND fmd.MANUFACTURER = 'SWIRE-CC'
GROUP BY 
    fmd.DATE;
In [ ]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_2dbeaf40_18e972e90ec') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2023-01-14 1406159.0 6063254.22
1 2023-06-17 1415479.0 6217725.66
2 2021-09-18 1504042.0 5152927.93
3 2021-06-05 1593802.0 5142980.14
4 2021-11-20 1531359.0 5326959.38
... ... ... ...
142 2021-04-17 1425183.0 4636863.16
143 2021-04-10 1565628.0 4975948.41
144 2021-01-23 1457032.0 4529955.38
145 2022-01-22 1411644.0 5062541.25
146 2022-12-31 1496171.0 6029405.19

147 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2023-01-14 1406159.0 6063254.22 2023 1 2
1 2023-06-17 1415479.0 6217725.66 2023 6 24
2 2021-09-18 1504042.0 5152927.93 2021 9 37
3 2021-06-05 1593802.0 5142980.14 2021 6 22
4 2021-11-20 1531359.0 5326959.38 2021 11 46
... ... ... ... ... ... ...
142 2021-04-17 1425183.0 4636863.16 2021 4 15
143 2021-04-10 1565628.0 4975948.41 2021 4 14
144 2021-01-23 1457032.0 4529955.38 2021 1 3
145 2022-01-22 1411644.0 5062541.25 2022 1 3
146 2022-12-31 1496171.0 6029405.19 2022 12 52

147 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Prophet TimeSeries Modeling ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/xwj2ohjc.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/20yokwn4.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=45361', 'data', 'file=/tmp/tmpshbwn_60/xwj2ohjc.json', 'init=/tmp/tmpshbwn_60/20yokwn4.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeljo0tht3g/prophet_model-20240401010402.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:04:02 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:04:02 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/tl0v_pbb.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/gl708g9i.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=1343', 'data', 'file=/tmp/tmpshbwn_60/tl0v_pbb.json', 'init=/tmp/tmpshbwn_60/gl708g9i.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model8sdz05iz/prophet_model-20240401010402.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:04:02 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:04:02 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
<ipython-input-52-159f9d2a9899>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
<ipython-input-52-159f9d2a9899>:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()

From this plot we can see that the best 13 weeks for unit sales are from january to march and dollor sales from october to december.

Now let's evaluate the model performance using the MAE and MSE

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Resetting index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/a8jgy03y.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/mtrne_97.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=22758', 'data', 'file=/tmp/tmpshbwn_60/a8jgy03y.json', 'init=/tmp/tmpshbwn_60/mtrne_97.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modell85eiptr/prophet_model-20240401010413.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:04:13 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:04:13 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/japwuo25.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/e6i2ij_4.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=45952', 'data', 'file=/tmp/tmpshbwn_60/japwuo25.json', 'init=/tmp/tmpshbwn_60/e6i2ij_4.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model91ohkqn9/prophet_model-20240401010414.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
01:04:14 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:04:14 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 84142.2759724092, MSE: 11464253863.828047
DOLLAR_SALES - MAE: 4927800.6610338185, MSE: 24365389117751.87

The MAE and MSE values for unit sales are 84142 and 11464253863.For dollor sales the respected values are 4927800 and 24365389117751.

Exponential Smoothing Model ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Defining the last date in the DataFrame
last_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    forecast = forecast.clip(lower=0)  # Ensure no negative values in the forecast
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

From the plot we can see that the best 13 weeks for unit sales are from november to February and dollor sales from august to october.

In [ ]:
# Defining the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-12 and end on 2024-02-04, with total sales: 18799739.97899931
Best 13 weeks for dollar sales start on 2024-08-04 and end on 2024-10-27, with total sales: 88970078.00788526
Best 13 weeks for Unit Sales:
2023-11-12    1.429675e+06
2023-11-19    1.406215e+06
2023-11-26    1.337633e+06
2023-12-03    1.413769e+06
2023-12-10    1.413843e+06
2023-12-17    1.402463e+06
2023-12-24    1.442195e+06
2023-12-31    1.635844e+06
2024-01-07    1.344716e+06
2024-01-14    1.385444e+06
2024-01-21    1.440308e+06
2024-01-28    1.698508e+06
2024-02-04    1.449127e+06
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2024-08-04    6.950594e+06
2024-08-11    7.074054e+06
2024-08-18    6.635315e+06
2024-08-25    6.619353e+06
2024-09-01    6.610059e+06
2024-09-08    6.866880e+06
2024-09-15    6.991518e+06
2024-09-22    6.732203e+06
2024-09-29    6.720961e+06
2024-10-06    7.104676e+06
2024-10-13    7.116964e+06
2024-10-20    6.852696e+06
2024-10-27    6.694806e+06
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['November', 'December', 'January', 'February'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['August', 'September', 'October'], dtype='object')

The total unit sales of these products in these 13 weeks are 18799739 and the dollar sales are 88970078.

Lets evaluate the model performance metrics

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 47282.503481257125, MSE: 3515332106.089926
DOLLAR_SALES - MAE: 213421.68449667966, MSE: 64885923395.17784

The MAE and MSE values for unit sales are 47282 and 3515332106 and the respected values for dollor sales are 213421 and 64885923395

The MAE values are decreased when compared to other models.So This is a quite good model.

Results ¶

The best 13 weeks for the products in the Southern region for unit sales start on 2023-11-12 and end on 2024-02-04, with total sales: 18799739. and the best 13 weeks for dollar sales start on 2024-08-04 and end on 2024-10-27, with total revenue: 88970078.

Next, we analyze the package type of .5L Multi Jug with the swire in southern region.

3.2 Demand forecasting on Package, Caloric Segment, Category and Manufacturer in Southern Regions ¶

We first filter the package '.5L Multi Jug' with 'Swire-cc', category 'SSD' and 'Regular' caloric segment in Southern Regions like KS, UT, CA, CO, AZ, NM, NV

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Pink Woodsy'. So, we fist consider the other Package : '.5L multijug' Caloric Segment: Regular, Market Category: SSD, Manufacturer: Swire-CC, and Brand:Peppy.

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
# Running this code will display the query used to generate your previous job

job = client.get_job('bquxjob_4b74ac33_18e97221c14') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT fmd.DATE,SUM(fmd.UNIT_SALES) AS UNIT_SALES, SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'REGULAR'
    AND fmd.CATEGORY = 'SSD'
    AND fmd.PACKAGE LIKE '%.5L MULTI JUG%'
    AND fmd.MANUFACTURER = 'SWIRE-CC'
GROUP BY DATE;
In [ ]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_4b74ac33_18e97221c14') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-03-20 1.0 1.29
1 2021-04-10 1.0 1.00
2 2022-01-01 1.0 1.79
3 2021-04-03 1.0 1.19
4 2021-05-15 1.0 1.00
5 2021-07-03 3.0 2.75
6 2022-05-28 1.0 1.25
7 2021-07-31 1.0 1.00
8 2021-07-10 1.0 1.00
9 2023-02-18 1.0 1.00
10 2021-06-26 1.0 1.00

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-03-20 1.0 1.29 2021 3 11
1 2021-04-10 1.0 1.00 2021 4 14
2 2022-01-01 1.0 1.79 2022 1 52
3 2021-04-03 1.0 1.19 2021 4 13
4 2021-05-15 1.0 1.00 2021 5 19
5 2021-07-03 3.0 2.75 2021 7 26
6 2022-05-28 1.0 1.25 2022 5 21
7 2021-07-31 1.0 1.00 2021 7 30
8 2021-07-10 1.0 1.00 2021 7 27
9 2023-02-18 1.0 1.00 2023 2 7
10 2021-06-26 1.0 1.00 2021 6 25

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Prophet TimeSeries Modeling ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['UNIT_SALES']].reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DOLLAR_SALES']].reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future = prophet_model_unit.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future)
forecast_dollar = prophet_model_dollar.predict(future)

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast):
    forecast['rolling_sum'] = forecast['yhat'].rolling(window=91, min_periods=1, center=True).sum()
    best_period_idx = forecast['rolling_sum'].idxmax()
    best_period_start = forecast.iloc[best_period_idx - 91//2]['ds']
    best_period_end = forecast.iloc[best_period_idx + 91//2]['ds']
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/bp83ieht.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/ioy8jjjq.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=96238', 'data', 'file=/tmp/tmp49_aqdkp/bp83ieht.json', 'init=/tmp/tmp49_aqdkp/ioy8jjjq.json', 'output', 'file=/tmp/tmp49_aqdkp/prophet_modeln9frmlrz/prophet_model-20240330170640.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:06:40 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:06:40 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/jwwc0lyh.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/u4i78zc8.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=88930', 'data', 'file=/tmp/tmp49_aqdkp/jwwc0lyh.json', 'init=/tmp/tmp49_aqdkp/u4i78zc8.json', 'output', 'file=/tmp/tmp49_aqdkp/prophet_model9fwrmrpe/prophet_model-20240330170641.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:06:41 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:06:41 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales are from june to august and dollor sales from april to june.

Lets evaluate model performance metrics

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 5.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/no2xspaq.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/f2_7pksv.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=84232', 'data', 'file=/tmp/tmpshbwn_60/no2xspaq.json', 'init=/tmp/tmpshbwn_60/f2_7pksv.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelj_h0aqbz/prophet_model-20240401005143.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
00:51:43 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
00:51:44 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 5.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/a7xs0kk5.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/tp_t4l07.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=20649', 'data', 'file=/tmp/tmpshbwn_60/a7xs0kk5.json', 'init=/tmp/tmpshbwn_60/tp_t4l07.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelya8_eqwd/prophet_model-20240401005145.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
00:51:45 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
00:51:46 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 29.616903467731532, MSE: 953.6583555096549
DOLLAR_SALES - MAE: 29.963570134398196, MSE: 975.9784331169152

The MAE and MSE values for unit sales are 29 and 953. For dollor sales the respected values are 29 and 975.

Results ¶

From this model we can see that the best 13 weeks for unit sales are from june to august and dollor sales from april to june.

Let's view the .5L Multi Jug with the non swire cc manufacturer.

3.3 Demand forecasting based on Package, Caloric Segment, Category and Non-Manufacturer¶

We first filter the package '.5L Multi Jug' with 'Non Swire-cc', category 'SSD' and 'Regular' caloric segment in Southern Regions like KS, UT, CA, CO, AZ, NM, NV

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Pink Woodsy'. So, we fist consider the other Package : '.5L multijug' Caloric Segment: Regular, Market Category: SSD, Manufacturer!= Swire-CC.

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_49035d5_18e936b4f5e') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT 
    fmd.DATE,
    SUM(fmd.UNIT_SALES) AS UNIT_SALES, 
    SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM 
    `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    JOIN `swirecc.consumer_demographics` cd ON zm.ZIP_CODE = cd.Zip
    WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE 
    fmd.PACKAGE LIKE '%.5L MULTI JUG%'
    AND fmd.CALORIC_SEGMENT = 'REGULAR'
    AND fmd.CATEGORY = 'SSD'
    AND fmd.MANUFACTURER != 'SWIRE-CC'
GROUP BY 
    fmd.DATE;
In [ ]:
job = client.get_job('bquxjob_49035d5_18e936b4f5e') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2023-10-21 1.0 1.50
1 2022-02-12 2.0 2.50
2 2023-04-08 3.0 4.50
3 2022-01-08 9.0 11.25
4 2021-10-30 1.0 1.25
5 2022-08-06 1.0 1.25
6 2021-03-20 1.0 1.79
7 2021-08-21 3.0 3.00
8 2022-04-23 2.0 2.50
9 2023-05-06 2.0 2.00
10 2022-07-30 1.0 1.25
11 2023-01-28 3.0 4.50
12 2021-10-16 3.0 3.00
13 2021-12-18 4.0 4.25
14 2023-10-07 3.0 4.00
15 2023-03-25 1.0 1.50
16 2022-12-17 1.0 1.50
17 2022-11-26 2.0 2.75
18 2023-03-04 5.0 5.00
19 2023-02-18 3.0 4.50
20 2022-07-02 3.0 3.75
21 2022-06-11 1.0 1.25
22 2023-08-26 1.0 1.50
23 2023-03-11 2.0 3.00
24 2023-02-04 1.0 1.50
25 2022-06-18 2.0 2.00
26 2023-08-05 4.0 5.50
27 2021-08-28 1.0 1.00
28 2023-07-08 1.0 1.50
29 2023-05-20 1.0 1.50
30 2022-09-17 3.0 3.75
31 2022-07-23 1.0 1.00
32 2022-04-09 1.0 1.25
33 2022-12-31 1.0 1.50
34 2023-07-01 1.0 1.50
35 2022-07-16 1.0 1.25
36 2022-03-26 3.0 3.75
37 2023-05-13 1.0 1.50
38 2022-08-27 1.0 1.25

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2023-10-21 1.0 1.50 2023 10 42
1 2022-02-12 2.0 2.50 2022 2 6
2 2023-04-08 3.0 4.50 2023 4 14
3 2022-01-08 9.0 11.25 2022 1 1
4 2021-10-30 1.0 1.25 2021 10 43
5 2022-08-06 1.0 1.25 2022 8 31
6 2021-03-20 1.0 1.79 2021 3 11
7 2021-08-21 3.0 3.00 2021 8 33
8 2022-04-23 2.0 2.50 2022 4 16
9 2023-05-06 2.0 2.00 2023 5 18
10 2022-07-30 1.0 1.25 2022 7 30
11 2023-01-28 3.0 4.50 2023 1 4
12 2021-10-16 3.0 3.00 2021 10 41
13 2021-12-18 4.0 4.25 2021 12 50
14 2023-10-07 3.0 4.00 2023 10 40
15 2023-03-25 1.0 1.50 2023 3 12
16 2022-12-17 1.0 1.50 2022 12 50
17 2022-11-26 2.0 2.75 2022 11 47
18 2023-03-04 5.0 5.00 2023 3 9
19 2023-02-18 3.0 4.50 2023 2 7
20 2022-07-02 3.0 3.75 2022 7 26
21 2022-06-11 1.0 1.25 2022 6 23
22 2023-08-26 1.0 1.50 2023 8 34
23 2023-03-11 2.0 3.00 2023 3 10
24 2023-02-04 1.0 1.50 2023 2 5
25 2022-06-18 2.0 2.00 2022 6 24
26 2023-08-05 4.0 5.50 2023 8 31
27 2021-08-28 1.0 1.00 2021 8 34
28 2023-07-08 1.0 1.50 2023 7 27
29 2023-05-20 1.0 1.50 2023 5 20
30 2022-09-17 3.0 3.75 2022 9 37
31 2022-07-23 1.0 1.00 2022 7 29
32 2022-04-09 1.0 1.25 2022 4 14
33 2022-12-31 1.0 1.50 2022 12 52
34 2023-07-01 1.0 1.50 2023 7 26
35 2022-07-16 1.0 1.25 2022 7 28
36 2022-03-26 3.0 3.75 2022 3 12
37 2023-05-13 1.0 1.50 2023 5 19
38 2022-08-27 1.0 1.25 2022 8 34

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.

Prophet TimeSeries Modeling ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features = forecast_features.set_index('DATE').sort_index()

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['UNIT_SALES']].reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DOLLAR_SALES']].reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future = prophet_model_unit.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future)
forecast_dollar = prophet_model_dollar.predict(future)

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast):
    forecast['rolling_sum'] = forecast['yhat'].rolling(window=91, min_periods=1, center=True).sum()
    best_period_idx = forecast['rolling_sum'].idxmax()
    best_period_start = forecast.iloc[best_period_idx - 91//2]['ds']
    best_period_end = forecast.iloc[best_period_idx + 91//2]['ds']
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nf8sdzrs.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/dogo5iah.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=76972', 'data', 'file=/tmp/tmpshbwn_60/nf8sdzrs.json', 'init=/tmp/tmpshbwn_60/dogo5iah.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model4nrkhp6h/prophet_model-20240401010039.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
01:00:39 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:00:40 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/_swtq71u.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/z78r3_ko.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=50596', 'data', 'file=/tmp/tmpshbwn_60/_swtq71u.json', 'init=/tmp/tmpshbwn_60/z78r3_ko.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelr60qqzxx/prophet_model-20240401010040.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
01:00:40 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:00:41 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales and dollor sales are from december to march.

Lets evaluate the model performance metrics

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 23.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nr4yktzz.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/ccyyebkr.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=99901', 'data', 'file=/tmp/tmpshbwn_60/nr4yktzz.json', 'init=/tmp/tmpshbwn_60/ccyyebkr.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelkczh018_/prophet_model-20240401010053.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
01:00:53 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:00:54 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:prophet:n_changepoints greater than number of observations. Using 23.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/vlqiw37n.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/wxkb11wf.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=62535', 'data', 'file=/tmp/tmpshbwn_60/vlqiw37n.json', 'init=/tmp/tmpshbwn_60/wxkb11wf.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelu2nv_qi4/prophet_model-20240401010054.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
01:00:54 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
01:00:55 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 3.382974095751682, MSE: 12.899271272796936
DOLLAR_SALES - MAE: 4.070474095751682, MSE: 18.703416615016657

The MAE and MSE values for unit sales are 3 and 12. For dollor sales the respected values are 4 and 18.

Results ¶

Since, the MAE values of the prophet model is very low, so the prophet model is good and From this model we can see that the best 13 weeks for unit sales are from june to august and dollor sales from april to june.

4. Innovative Product ¶

Item Description: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug
Caloric Segment: Regular
Market Category: ING Enhanced Water
Manufacturer: Swire-CC
Brand: Greetingle
Package Type: .5L 12One Jug
Flavor: ‘Woodsy Yellow’

Swire plans to release this product for 13 weeks, but only in one region.
Which region would it perform best in?

4.1 Demand forecasting on Package, Manufacturer, Category in the Northern region¶

We first filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Swire-cc' manufacturer in North Regions

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer : Swire-CC.

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_28d5a02f_18e978a1a5b') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT 
    fmd.DATE,
    SUM(fmd.UNIT_SALES) AS UNIT_SALES, 
    SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
    AND fmd.MANUFACTURER = 'SWIRE-CC'
    AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY 
    fmd.DATE;
In [ ]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_28d5a02f_18e978a1a5b') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-02-06 5.0 12.56
1 2021-07-31 5.0 14.25
2 2021-08-07 2.0 5.68
3 2021-05-15 6.0 34.67
4 2021-07-03 6.0 14.26
5 2021-12-04 1.0 2.49
6 2021-10-09 3.0 28.86
7 2021-08-14 5.0 12.45
8 2021-07-10 4.0 9.96
9 2021-08-28 1.0 2.49
10 2021-10-02 1.0 2.49
11 2021-01-30 3.0 16.98
12 2021-05-01 5.0 11.47
13 2022-02-12 13.0 32.37
14 2022-02-26 1.0 2.49
15 2021-06-26 2.0 4.98
16 2021-12-25 10.0 25.20
17 2021-09-04 1.0 2.49
18 2021-05-08 2.0 4.00
19 2021-05-29 4.0 8.00
20 2021-02-13 11.0 25.25
21 2021-03-20 7.0 16.45
22 2023-06-10 1.0 23.88
23 2021-06-12 4.0 9.96
24 2021-03-27 2.0 4.00
25 2021-09-11 1.0 2.49
26 2022-01-08 4.0 11.16
27 2021-10-30 1.0 2.49
28 2021-03-13 2.0 4.00
29 2021-07-17 2.0 4.98
30 2021-04-24 5.0 10.49
31 2021-04-03 1.0 2.49
32 2021-01-16 1.0 20.00
33 2022-03-12 1.0 2.49
34 2021-06-05 6.0 12.98
35 2021-09-18 2.0 2.91
36 2021-02-27 3.0 7.47
37 2021-09-25 16.0 38.17
38 2021-03-06 17.0 34.39
39 2021-02-20 14.0 50.37
40 2021-04-10 4.0 8.98
41 2021-05-22 18.0 36.06
42 2023-07-08 1.0 23.88
43 2021-04-17 4.0 8.00
44 2021-01-23 1.0 24.00
45 2021-06-19 8.0 16.00
46 2022-06-25 15.0 40.35
47 2021-10-23 2.0 4.98

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-05-22 18.0 36.06 2021 5 20
1 2021-04-10 4.0 8.98 2021 4 14
2 2021-04-17 4.0 8.00 2021 4 15
3 2021-01-23 1.0 24.00 2021 1 3
4 2021-06-19 8.0 16.00 2021 6 24
5 2023-07-08 1.0 23.88 2023 7 27
6 2021-12-04 1.0 2.49 2021 12 48
7 2021-05-15 6.0 34.67 2021 5 19
8 2021-07-03 6.0 14.26 2021 7 26
9 2021-10-09 3.0 28.86 2021 10 40
10 2021-03-13 2.0 4.00 2021 3 10
11 2021-04-03 1.0 2.49 2021 4 13
12 2021-04-24 5.0 10.49 2021 4 16
13 2021-07-17 2.0 4.98 2021 7 28
14 2021-08-14 5.0 12.45 2021 8 32
15 2021-10-02 1.0 2.49 2021 10 39
16 2021-07-10 4.0 9.96 2021 7 27
17 2021-08-28 1.0 2.49 2021 8 34
18 2021-02-06 5.0 12.56 2021 2 5
19 2021-08-07 2.0 5.68 2021 8 31
20 2021-07-31 5.0 14.25 2021 7 30
21 2021-09-04 1.0 2.49 2021 9 35
22 2021-12-25 10.0 25.20 2021 12 51
23 2021-05-01 5.0 11.47 2021 5 17
24 2021-01-30 3.0 16.98 2021 1 4
25 2022-02-12 13.0 32.37 2022 2 6
26 2022-02-26 1.0 2.49 2022 2 8
27 2021-06-26 2.0 4.98 2021 6 25
28 2021-05-08 2.0 4.00 2021 5 18
29 2021-09-18 2.0 2.91 2021 9 37
30 2021-02-27 3.0 7.47 2021 2 8
31 2021-06-05 6.0 12.98 2021 6 22
32 2021-01-16 1.0 20.00 2021 1 2
33 2022-03-12 1.0 2.49 2022 3 10
34 2021-02-13 11.0 25.25 2021 2 6
35 2021-05-29 4.0 8.00 2021 5 21
36 2021-06-12 4.0 9.96 2021 6 23
37 2021-03-27 2.0 4.00 2021 3 12
38 2023-06-10 1.0 23.88 2023 6 23
39 2021-03-20 7.0 16.45 2021 3 11
40 2021-10-30 1.0 2.49 2021 10 43
41 2022-01-08 4.0 11.16 2022 1 1
42 2021-09-11 1.0 2.49 2021 9 36
43 2022-06-25 15.0 40.35 2022 6 25
44 2021-10-23 2.0 4.98 2021 10 42
45 2021-09-25 16.0 38.17 2021 9 38
46 2021-02-20 14.0 50.37 2021 2 7
47 2021-03-06 17.0 34.39 2021 3 9

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Prophet Time Series Model¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/0svx7mb1.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_phittwi.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=31875', 'data', 'file=/tmp/tmpu6u1ud2o/0svx7mb1.json', 'init=/tmp/tmpu6u1ud2o/_phittwi.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelwmhj1n5r/prophet_model-20240331061732.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
06:17:32 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
06:17:33 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/phh7oqcn.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_naalr95.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=71823', 'data', 'file=/tmp/tmpu6u1ud2o/phh7oqcn.json', 'init=/tmp/tmpu6u1ud2o/_naalr95.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelnezw0r1x/prophet_model-20240331061733.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
06:17:33 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
06:17:34 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales are from januay to march and dollor sales are from may to august.

Lets evaluate the model performance metrics.

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9fr3okji.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/oo83c7fy.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=12612', 'data', 'file=/tmp/tmpu6u1ud2o/9fr3okji.json', 'init=/tmp/tmpu6u1ud2o/oo83c7fy.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model3kybzigo/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
04:12:29 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
04:12:29 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/zznb9jj_.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ng9ybh30.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70447', 'data', 'file=/tmp/tmpu6u1ud2o/zznb9jj_.json', 'init=/tmp/tmpu6u1ud2o/ng9ybh30.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelgdpy4abj/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
04:12:29 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
04:12:29 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 84142.2759724092, MSE: 11464253863.828045
DOLLAR_SALES - MAE: 4927800.661033819, MSE: 24365389117751.88

The MAE and MSE values for unit sales are 84142 and 11464253863. For dollor sales the respected values are 4927800 and 24365389117751.

Results ¶

From this model evaluations we can say that these values are high because we only considered the Northern region which contains low sales that's why the data is not picking up the consistent pattern in the sales. But the forecast sales are also low in the Northern region in numbers around 5 units and dollar sales around 30 dollars.

4.2 Demand forecasting on Package, Manufacturer, Category in the Southern region¶

We then filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Swire-cc' manufacturer in South Regions

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer : Swire-CC in south region

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
# This code will display the query used to generate your previous job.
job = client.get_job('bquxjob_53d98272_18e9377e43f') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT 
    fmd.DATE,
    SUM(fmd.UNIT_SALES) AS UNIT_SALES, 
    SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
    AND fmd.MANUFACTURER = 'SWIRE-CC'
    AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY 
    fmd.DATE;
In [ ]:
# This code will read results from your previous job
job = client.get_job('bquxjob_53d98272_18e9377e43f') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-02-13 13.0 29.93
1 2021-03-27 13.0 30.41
2 2021-09-11 7.0 17.73
3 2021-10-30 2.0 26.37
4 2021-06-12 12.0 29.98
... ... ... ...
113 2023-04-29 1.0 2.89
114 2022-12-03 3.0 8.77
115 2023-06-03 1.0 2.89
116 2023-02-11 3.0 8.67
117 2022-06-11 4.0 10.76

118 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-02-13 13.0 29.93 2021 2 6
1 2021-03-27 13.0 30.41 2021 3 12
2 2021-09-11 7.0 17.73 2021 9 36
3 2021-10-30 2.0 26.37 2021 10 43
4 2021-06-12 12.0 29.98 2021 6 23
... ... ... ... ... ... ...
113 2023-04-29 1.0 2.89 2023 4 17
114 2022-12-03 3.0 8.77 2022 12 48
115 2023-06-03 1.0 2.89 2023 6 22
116 2023-02-11 3.0 8.67 2023 2 6
117 2022-06-11 4.0 10.76 2022 6 23

118 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Prophet Time Series Model ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9hl2aii_.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/4ru1jqhn.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=16511', 'data', 'file=/tmp/tmpu6u1ud2o/9hl2aii_.json', 'init=/tmp/tmpu6u1ud2o/4ru1jqhn.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modely9aefsr6/prophet_model-20240331074543.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:45:43 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:45:43 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/t1f4i86c.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/l98sphy3.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=26575', 'data', 'file=/tmp/tmpu6u1ud2o/t1f4i86c.json', 'init=/tmp/tmpu6u1ud2o/l98sphy3.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelx_zh0y7h/prophet_model-20240331074543.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:45:43 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:45:43 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales are from october to january and dollor sales are from may to july.

Lets evaluate the model performance metrics.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()  # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy()  # Make a copy to avoid modifying the original DataFrame

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

# Ensure that the DataFrame has the columns 'DATE' and 'UNIT_SALES' or 'DOLLAR_SALES'
# If you have different column names, replace 'UNIT_SALES' or 'DOLLAR_SALES' accordingly
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)  # for UNIT_SALES forecasting
# or
# train.rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'}, inplace=True)  # for DOLLAR_SALES forecasting

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/zxxuy1kw.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/kl51j3h0.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=52808', 'data', 'file=/tmp/tmpu6u1ud2o/zxxuy1kw.json', 'init=/tmp/tmpu6u1ud2o/kl51j3h0.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model96fwf2qa/prophet_model-20240331074545.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
07:45:45 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:45:46 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/fv92efm5.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/e2khrykp.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=7373', 'data', 'file=/tmp/tmpu6u1ud2o/fv92efm5.json', 'init=/tmp/tmpu6u1ud2o/e2khrykp.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelj44y2p3o/prophet_model-20240331074546.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
07:45:46 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:45:46 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 14.760235496932111, MSE: 243.83874523289515
DOLLAR_SALES - MAE: 19.94565216359878, MSE: 471.88964169847605

The MAE and MSE values for unit sales are 14 and 243 . For dollor sales the respected values are 19 and 471.

Results ¶

When compared to the Northern region, the Southern region has high sales and the MAE and MSE values are low which infers that the model is good. But, the sales in the Southern region is also decreasing over the years.

Since, the datapoints are not available and the right combination of package type is not available, we refer to the non - swire products to further analysis.

4.3 Demand forecasting on Category, Non-Manufacturer, and Package in North Region¶

We now filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Non Swire-cc' manufacturer in North Regions

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer is not Swire-CC in north region

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
# This code will display the query used to generate your previous job.
job = client.get_job('bquxjob_70b31c_18e937d73eb') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT 
    fmd.DATE,
    SUM(fmd.UNIT_SALES) AS UNIT_SALES, 
    SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
    AND fmd.MANUFACTURER != 'SWIRE-CC'
    AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY 
    fmd.DATE;
In [ ]:
# This code will read results from your previous job
job = client.get_job('bquxjob_70b31c_18e937d73eb') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2022-04-23 11404.0 87848.49
1 2023-06-10 12910.0 116092.66
2 2021-05-29 9963.0 66150.83
3 2021-10-30 7882.0 53111.21
4 2023-04-08 11684.0 104108.96
... ... ... ...
143 2021-07-17 12477.0 83090.46
144 2021-04-24 8444.0 56115.15
145 2021-04-03 7517.0 49721.32
146 2022-01-01 6378.0 49240.57
147 2021-11-06 7643.0 51266.10

148 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd

# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2022-04-23 11404.0 87848.49 2022 4 16
1 2023-06-10 12910.0 116092.66 2023 6 23
2 2021-05-29 9963.0 66150.83 2021 5 21
3 2021-10-30 7882.0 53111.21 2021 10 43
4 2023-04-08 11684.0 104108.96 2023 4 14
... ... ... ... ... ... ...
143 2021-07-17 12477.0 83090.46 2021 7 28
144 2021-04-24 8444.0 56115.15 2021 4 16
145 2021-04-03 7517.0 49721.32 2021 4 13
146 2022-01-01 6378.0 49240.57 2022 1 52
147 2021-11-06 7643.0 51266.10 2021 11 44

148 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Prophet Time Series Model ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/htx7on99.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/uyby8f3e.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=53881', 'data', 'file=/tmp/tmpu6u1ud2o/htx7on99.json', 'init=/tmp/tmpu6u1ud2o/uyby8f3e.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model9ikd6hr0/prophet_model-20240331075352.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:53:52 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:53:52 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/exz6b0db.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/gpvt6fco.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=28164', 'data', 'file=/tmp/tmpu6u1ud2o/exz6b0db.json', 'init=/tmp/tmpu6u1ud2o/gpvt6fco.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model5y1cvch8/prophet_model-20240331075352.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:53:52 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:53:52 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.

Lets evaluate the model perdormance metrics.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()
test = forecast_features.iloc[split_point:].copy()

# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9fr3okji.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/oo83c7fy.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=12612', 'data', 'file=/tmp/tmpu6u1ud2o/9fr3okji.json', 'init=/tmp/tmpu6u1ud2o/oo83c7fy.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model3kybzigo/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
04:12:29 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
04:12:29 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/zznb9jj_.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ng9ybh30.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70447', 'data', 'file=/tmp/tmpu6u1ud2o/zznb9jj_.json', 'init=/tmp/tmpu6u1ud2o/ng9ybh30.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelgdpy4abj/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
04:12:29 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
04:12:29 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 84142.2759724092, MSE: 11464253863.828045
DOLLAR_SALES - MAE: 4927800.661033819, MSE: 24365389117751.88

The MAE and MSE values for unit sales are 84142 and 11464253863.For Dollor sales the respected values are 4927800 and 24365389117751.

Exponential Smoothing ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame
last_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window='182D').sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=13)
    return best_period_start, best_period_end

# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

From the plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.

In [ ]:
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2024-07-14 and end on 2024-10-06, with total sales: 190316.1355028812
Best 13 weeks for dollar sales start on 2024-07-14 and end on 2024-10-06, with total sales: 1682082.4189948225
Best 13 weeks for Unit Sales:
2024-07-14    14310.653500
2024-07-21    15523.281384
2024-07-28    17459.010526
2024-08-04    16674.040234
2024-08-11    15631.792997
2024-08-18    15306.547874
2024-08-25    15678.845612
2024-09-01    13125.795352
2024-09-08    13452.970459
2024-09-15    12497.527782
2024-09-22    13325.562461
2024-09-29    12780.805251
2024-10-06    14549.302071
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2024-07-14    127282.602900
2024-07-21    135063.109120
2024-07-28    149215.453788
2024-08-04    144120.410291
2024-08-11    137306.625192
2024-08-18    133775.094500
2024-08-25    135460.550054
2024-09-01    116160.568178
2024-09-08    121629.692055
2024-09-15    114710.283040
2024-09-22    120714.608959
2024-09-29    116253.996762
2024-10-06    130389.424156
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['July', 'August', 'September', 'October'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['July', 'August', 'September', 'October'], dtype='object')

The total unit sales of these products in these 13 weeks are 190316 and the dollar sales are 1682082.

In [ ]:
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 4711.606597501726, MSE: 26676303.715555746
DOLLAR_SALES - MAE: 32546.860312665805, MSE: 1267153502.8132436

The MAE and MSE values for unit sales are 4711 and 26676303. For dollor sales the respected values are 32546 and 1267153502.

The MAE values are decreased when compared to other models.So this is a quite good model.

Results ¶

From the exponential smoothing model, the non- swire cc manufactures who sell products consists of the package '.5L 12One Jug'and category 'Ing Enhanced Water' in Northern region the best 13 weeks are July to October with sales 190316 and the dollar sales are 1682082.

4.4 Demand forecasting on Category, Non-Manufacturer, and Package in Southern Region¶

We now filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Non Swire-cc' manufacturer in South Regions

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer is not Swire-CC in south region

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
# This code will display the query used to generate your previous job.
job = client.get_job('bquxjob_3930fb1e_18e9383de9e') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT 
    fmd.DATE,
    SUM(fmd.UNIT_SALES) AS UNIT_SALES, 
    SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
    AND fmd.MANUFACTURER != 'SWIRE-CC'
    AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY 
    fmd.DATE;
In [ ]:
# This code will read results from your previous job
job = client.get_job('bquxjob_3930fb1e_18e9383de9e') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2022-11-12 33680.0 269639.90
1 2022-05-14 39223.0 301716.69
2 2023-07-01 35752.0 308037.89
3 2021-11-13 24727.0 162735.79
4 2021-09-25 33956.0 222055.26
... ... ... ...
143 2022-04-30 35650.0 275004.30
144 2022-10-01 35831.0 287711.35
145 2022-05-28 36645.0 274302.38
146 2021-12-11 26730.0 194410.59
147 2021-08-07 33884.0 222792.13

148 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2022-11-12 33680.0 269639.90 2022 11 45
1 2022-05-14 39223.0 301716.69 2022 5 19
2 2023-07-01 35752.0 308037.89 2023 7 26
3 2021-11-13 24727.0 162735.79 2021 11 45
4 2021-09-25 33956.0 222055.26 2021 9 38
... ... ... ... ... ... ...
143 2022-04-30 35650.0 275004.30 2022 4 17
144 2022-10-01 35831.0 287711.35 2022 10 39
145 2022-05-28 36645.0 274302.38 2022 5 21
146 2021-12-11 26730.0 194410.59 2021 12 49
147 2021-08-07 33884.0 222792.13 2021 8 31

148 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Prophet Time Series Model ¶

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

In [ ]:
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)

# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)

# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)

# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()

# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
    # Restrict to forecasted data after the last historical date
    forecast_future = forecast[forecast['ds'] > last_historical_date]
    forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
    best_period_idx = forecast_future['rolling_sum'].idxmax()
    best_period_start = forecast_future.loc[best_period_idx]['ds']
    best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
    return best_period_start, best_period_end

# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)

# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
    fig = model.plot(forecast)
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ny1ym660.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/09hcjubg.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=99743', 'data', 'file=/tmp/tmpu6u1ud2o/ny1ym660.json', 'init=/tmp/tmpu6u1ud2o/09hcjubg.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelew45gzis/prophet_model-20240331075833.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:58:33 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:58:33 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/dl292s72.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/xi_iqncb.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=72599', 'data', 'file=/tmp/tmpu6u1ud2o/dl292s72.json', 'init=/tmp/tmpu6u1ud2o/xi_iqncb.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model3h8lsevb/prophet_model-20240331075833.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:58:34 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:58:34 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing

From this plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.

Lets evaluate the model performance metrics.

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()
test = forecast_features.iloc[split_point:].copy()

# Resetting index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)

train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)

# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected

# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])

# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])  # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/yylp_0gg.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/to2wqdnk.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70757', 'data', 'file=/tmp/tmpu6u1ud2o/yylp_0gg.json', 'init=/tmp/tmpu6u1ud2o/to2wqdnk.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelp7yalqgz/prophet_model-20240331075836.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:58:36 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:58:36 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/vyemm4sj.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/m9janl1s.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=23826', 'data', 'file=/tmp/tmpu6u1ud2o/vyemm4sj.json', 'init=/tmp/tmpu6u1ud2o/m9janl1s.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model8yxnb6q_/prophet_model-20240331075836.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
07:58:36 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
07:58:37 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 4741.745698006632, MSE: 30453680.45680512
DOLLAR_SALES - MAE: 244077.61953977996, MSE: 60052964894.22412

The MAE and MSE values for unit sales are 4741 and 30453680 . For dollor sales the respected values are 244077 and 60052964894.

Exponential Smoothing ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame
last_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window='182D').sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=13)
    return best_period_start, best_period_end

# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)

# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)

# Plotting function with adjustment for negative values
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))

    # Ensure no negative values in the forecast
    forecast_positive = forecast.clip(lower=0)

    plt.plot(forecast_positive.index, forecast_positive, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')

From the plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.

In [ ]:
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
    rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=12)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)

# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'

# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)

print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2024-07-14 and end on 2024-10-06, with total sales: 464594.4540587605
Best 13 weeks for dollar sales start on 2024-07-14 and end on 2024-10-06, with total sales: 4258204.564067395
Best 13 weeks for Unit Sales:
2024-07-14    42443.154399
2024-07-21    37839.956020
2024-07-28    32907.331064
2024-08-04    33482.475829
2024-08-11    32264.235176
2024-08-18    34061.186527
2024-08-25    34373.593740
2024-09-01    34129.642549
2024-09-08    35262.415109
2024-09-15    33552.560289
2024-09-22    36232.678257
2024-09-29    39074.351117
2024-10-06    38970.873982
Freq: W-SUN, dtype: float64

Best 13 weeks for Dollar Sales:
2024-07-14    369900.668096
2024-07-21    340664.648405
2024-07-28    309588.930179
2024-08-04    313950.912343
2024-08-11    306419.112519
2024-08-18    316457.858622
2024-08-25    318355.324297
2024-09-01    317236.414892
2024-09-08    324651.213526
2024-09-15    312820.160369
2024-09-22    330256.981539
2024-09-29    349097.149445
2024-10-06    348805.189837
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 13-week period:
Index(['July', 'August', 'September', 'October'], dtype='object')

Best months for Dollar Sales within the 13-week period:
Index(['July', 'August', 'September', 'October'], dtype='object')

The total unit sales of these products in these 13 weeks are 464594 and the dollar sales are 4258204.

Lets evaluate the model performance metrics.

In [ ]:
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 13323.255430846973, MSE: 214736828.1026801
DOLLAR_SALES - MAE: 87849.25315413318, MSE: 9191142635.488482

The MAE and MSE values for unit sales are 13323 and 214736828. For dollor sales the respected values are 87849 and 9191142635.

Results ¶

From the exponential smoothing model, the non- swire cc manufactures who sell products consists of the package '.5L 12One Jug'and category 'Ing Enhanced Water' in Southern region the best 13 weeks are July to October with sales 464591 and the dollar sales are 4258204 which is quite high when compared to Northern region. For non- swire cc products of the package '.5L 12One Jug'and category 'Ing Enhanced Water' the best 13 weeks are July to October in the Southern Region.

5. Innovative Product ¶

Item Description: Diet Energy Moonlit Casava 2L Multi Jug
Caloric Segment: Diet
Market Category: Energy
Manufacturer: Swire-CC
Brand: Diet Moonlit
Package Type: 2L Multi Jug
Flavor: ‘Cassava’

Swire plans to release this product for 13 weeks, but only in one region.
Which region would it perform best in?

5.1 Demand forecasting on Category, Manufacturer and Caloric Segment ¶

We first applied filter to Category 'Energy' with 'Swire-CC', having caloric segment as 'Diet/Light'

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of Flavor: 'Cassava'. So, we fist consider the other caloric segment : Diet/light Market Category: energy, Manufacturer : Swire-CC.

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_457a6c38_18e96f5f940') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE CATEGORY = 'ENERGY'
AND MANUFACTURER = 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_457a6c38_18e96f5f940') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-01-09 4593.0 4303.21
1 2021-10-02 3856.0 3522.99
2 2021-08-28 4322.0 4004.36
3 2023-07-15 2036.0 2123.17
4 2021-08-14 4615.0 4211.04
... ... ... ...
134 2023-04-22 1925.0 2090.46
135 2023-07-22 2009.0 2135.09
136 2022-05-14 2841.0 3044.46
137 2022-02-05 3201.0 2870.83
138 2021-11-13 3704.0 3346.70

139 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-01-09 4593.0 4303.21 2021 1 1
1 2021-10-02 3856.0 3522.99 2021 10 39
2 2021-08-28 4322.0 4004.36 2021 8 34
3 2023-07-15 2036.0 2123.17 2023 7 28
4 2021-08-14 4615.0 4211.04 2021 8 32
... ... ... ... ... ... ...
134 2023-04-22 1925.0 2090.46 2023 4 16
135 2023-07-22 2009.0 2135.09 2023 7 29
136 2022-05-14 2841.0 3044.46 2022 5 19
137 2022-02-05 3201.0 2870.83 2022 2 5
138 2021-11-13 3704.0 3346.70 2021 11 45

139 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Exponential Smoothing Modeling ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame for historical data
last_historical_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_historical_date, periods=53, freq='W')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 6 months (approximately 26 weeks)
def find_best_26_weeks(forecast):
    rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=25)  # 26 weeks include the end week
    return best_period_start, best_period_end

# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_26_weeks(exp_forecast)

# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_26_weeks(exp_forecast_dollar)

# Plotting function with the best 6 months highlighted
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 6 Months')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 6 months highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 6 Months Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 6 Months Highlighted')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(

From the plot we can see that the best 26 weeks for unit sales and dollor sales are from november to april.

In [ ]:
# Define the function to find the best 26 weeks
def find_best_26_weeks(forecast):
    rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=25)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 26 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_26_weeks(exp_forecast)

# Find the best 26 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_26_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 26 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 26 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

# Now, let's find the values for the best 26 weeks
best_26_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_26_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 26 weeks for Unit Sales:")
print(best_26_weeks_values_unit)

print("\nBest 26 weeks for Dollar Sales:")
print(best_26_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_26_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_26_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 26-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 26-week period:")
print(best_months_dollar)
Best 26 weeks for unit sales start on 2023-11-05 and end on 2024-04-28, with total sales: 40468.83476378765
Best 26 weeks for dollar sales start on 2023-11-05 and end on 2024-04-28, with total sales: 46427.21591674558
Best 26 weeks for Unit Sales:
2023-11-05    2171.170239
2023-11-12    2321.668503
2023-11-19    2193.372092
2023-11-26    2047.140098
2023-12-03    2296.533245
2023-12-10    2369.246592
2023-12-17    2049.663623
2023-12-24    1974.224296
2023-12-31    1790.268407
2024-01-07    2193.979804
2024-01-14    1956.610374
2024-01-21    1502.508981
2024-01-28    1321.018341
2024-02-04    1291.585614
2024-02-11    1362.249492
2024-02-18    1238.811750
2024-02-25     962.418576
2024-03-03    1076.092944
2024-03-10    1343.447003
2024-03-17    1220.509414
2024-03-24    1102.451545
2024-03-31    1048.203694
2024-04-07     991.819266
2024-04-14    1138.972105
2024-04-21     797.561897
2024-04-28     707.306868
Freq: W-SUN, dtype: float64

Best 26 weeks for Dollar Sales:
2023-11-05    2304.329268
2023-11-12    2380.930372
2023-11-19    2293.462332
2023-11-26    2163.956281
2023-12-03    2447.167730
2023-12-10    2493.962990
2023-12-17    2217.521035
2023-12-24    2158.136001
2023-12-31    1956.502936
2024-01-07    2419.773739
2024-01-14    2115.183791
2024-01-21    1631.926909
2024-01-28    1491.065328
2024-02-04    1537.274507
2024-02-11    1628.122270
2024-02-18    1524.160686
2024-02-25    1252.504368
2024-03-03    1376.407396
2024-03-10    1656.640720
2024-03-17    1487.357865
2024-03-24    1508.118727
2024-03-31    1443.083223
2024-04-07    1372.871905
2024-04-14    1461.203071
2024-04-21    1104.389924
2024-04-28    1001.162543
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 26-week period:
Index(['November', 'December', 'January', 'February', 'March', 'April'], dtype='object')

Best months for Dollar Sales within the 26-week period:
Index(['November', 'December', 'January', 'February', 'March', 'April'], dtype='object')

The total unit sales of these products in these 26 weeks are 40468 and the dollar sales are 46427.

Lets evaluate the model performance metrics

In [ ]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
UNIT_SALES - MAE: 239.80472298824847, MSE: 88183.82539630556
DOLLAR_SALES - MAE: 417.2807652676282, MSE: 262579.3380016585

The MAE and MSE values for unit sales are 239 and 88183. For dollor sales the respected values are 417 and 262579.

Results ¶

From the model, we can say the best 6 months sales from November to April with the The total unit sales are 40468 and the dollar sales are 46427.

Since there is no flavour category in this combination,we use non swire-cc data to model the flavour category.

5.2 Demand forecasting on Flavor, Non-Manufacturer, Caloric Segment ¶

We first applied filter to Flavor 'Casava' with 'Non Swire-CC', having caloric segment as 'Diet/Light'

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

In the dataset provided to us we don't have dataset with combinations of package . So, we fist consider the other caloric segment : Diet/light Manufacturer : Swire-CC. Flavor: 'Cassava'

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_64491685_18e97366214') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE ITEM LIKE '%CASAVA%'
AND MANUFACTURER != 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_64491685_18e97366214') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-04-03 6610.0 24789.77
1 2021-07-17 24918.0 87976.93
2 2023-05-06 43457.0 177678.72
3 2021-04-24 6645.0 24910.45
4 2021-03-13 6783.0 25277.23
... ... ... ...
143 2022-06-25 50474.0 180745.74
144 2022-06-11 46432.0 166148.73
145 2023-03-11 43351.0 176258.03
146 2021-10-23 13933.0 50896.24
147 2023-02-11 45757.0 175155.80

148 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-04-03 6610.0 24789.77 2021 4 13
1 2021-07-17 24918.0 87976.93 2021 7 28
2 2023-05-06 43457.0 177678.72 2023 5 18
3 2021-04-24 6645.0 24910.45 2021 4 16
4 2021-03-13 6783.0 25277.23 2021 3 10
... ... ... ... ... ... ...
143 2022-06-25 50474.0 180745.74 2022 6 25
144 2022-06-11 46432.0 166148.73 2022 6 23
145 2023-03-11 43351.0 176258.03 2023 3 10
146 2021-10-23 13933.0 50896.24 2021 10 42
147 2023-02-11 45757.0 175155.80 2023 2 6

148 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Exponential Smoothing Modeling ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame for historical data
last_historical_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_historical_date, periods=53, freq='W')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 6 months (approximately 26 weeks)
def find_best_26_weeks(forecast):
    rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=25)  # 26 weeks include the end week
    return best_period_start, best_period_end

# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_26_weeks(exp_forecast)

# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_26_weeks(exp_forecast_dollar)

# Plotting function with the best 6 months highlighted
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 6 Months')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 6 months highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 6 Months Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 6 Months Highlighted')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(

From the plot we can see that the best 26 weeks for unit sales and dollor sales are from April to october.

In [ ]:
# Define the function to find the best 26 weeks
def find_best_26_weeks(forecast):
    rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=25)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 26 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_26_weeks(exp_forecast)

# Find the best 26 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_26_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 26 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 26 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

# Now, let's find the values for the best 26 weeks
best_26_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_26_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 26 weeks for Unit Sales:")
print(best_26_weeks_values_unit)

print("\nBest 26 weeks for Dollar Sales:")
print(best_26_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_26_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_26_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 26-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 26-week period:")
print(best_months_dollar)
Best 26 weeks for unit sales start on 2024-04-21 and end on 2024-10-13, with total sales: 1227793.7380098142
Best 26 weeks for dollar sales start on 2024-04-21 and end on 2024-10-13, with total sales: 4914402.822625622
Best 26 weeks for Unit Sales:
2024-04-21    40418.033021
2024-04-28    42740.520955
2024-05-05    44012.896388
2024-05-12    44181.204299
2024-05-19    44054.482594
2024-05-26    44764.104845
2024-06-02    46214.687403
2024-06-09    50902.053672
2024-06-16    53630.811032
2024-06-23    54461.491084
2024-06-30    46443.993389
2024-07-07    40978.131545
2024-07-14    40902.448850
2024-07-21    41627.635865
2024-07-28    47028.436022
2024-08-04    48826.511522
2024-08-11    51455.305162
2024-08-18    54435.810559
2024-08-25    53226.901432
2024-09-01    52638.750902
2024-09-08    53115.994937
2024-09-15    49865.496388
2024-09-22    47035.393628
2024-09-29    47466.708458
2024-10-06    44586.997127
2024-10-13    42778.936930
Freq: W-SUN, dtype: float64

Best 26 weeks for Dollar Sales:
2024-04-21    171111.403175
2024-04-28    180729.921899
2024-05-05    186967.150778
2024-05-12    186669.077060
2024-05-19    184906.952382
2024-05-26    186274.268576
2024-06-02    189545.580864
2024-06-09    202315.589330
2024-06-16    207866.724115
2024-06-23    209819.995727
2024-06-30    178516.050156
2024-07-07    164623.789983
2024-07-14    164515.663724
2024-07-21    169518.801625
2024-07-28    187799.778810
2024-08-04    193557.738142
2024-08-11    202432.954340
2024-08-18    212103.515913
2024-08-25    206851.792013
2024-09-01    205743.429550
2024-09-08    206073.407932
2024-09-15    195979.133752
2024-09-22    186706.393930
2024-09-29    185569.311947
2024-10-06    176809.675542
2024-10-13    171394.721363
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 26-week period:
Index(['April', 'May', 'June', 'July', 'August', 'September', 'October'], dtype='object')

Best months for Dollar Sales within the 26-week period:
Index(['April', 'May', 'June', 'July', 'August', 'September', 'October'], dtype='object')

The total unit sales of these products in these 26 weeks are 1227793 and the dollar sales are 4914402.

Lets evaluate model performance metrics

In [ ]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 10314.963439861342, MSE: 133439789.47343811
DOLLAR_SALES - MAE: 38728.14920459904, MSE: 1828464035.8834724
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(

The MAE and MSE values for unit sales are 10314 and 133439789. For dollor sales the respected values are 38728 and 1828464035.

Results ¶

From the non-swire cc and flavour: cassava model, we can say the best 6 months sales from April to October with the unit sales of 1227793 and the dollar sales are 4914402.

In the next model we use package combination to explore the sales with package:2L multijug.

5.3 Demand forecasting based on Package, Manufacturer, Caloric Segment and Brand ¶

We first applied filter to Package '2L Multi Jug' with 'Swire-CC', having caloric segment as 'Diet/Light' and brand 'Diet Moonlit'

Data Preparation ¶

Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.

we fist consider the these specifications caloric segment : Diet/light Manufacturer : Swire-CC. package : 2L multijug brand : diet moonlit

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_7af2f6fe_18e973a652e') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE PACKAGE = '2L MULTI JUG'
AND MANUFACTURER = 'SWIRE-CC'
AND CALORIC_SEGMENT = 'DIET/LIGHT'
AND BRAND = 'DIET MOONLIT'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_7af2f6fe_18e973a652e') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-02-13 20212.0 30147.19
1 2022-08-20 16085.0 25081.16
2 2021-09-11 14129.0 21345.83
3 2022-04-23 18314.0 28393.05
4 2022-01-08 14753.0 21325.79
... ... ... ...
142 2021-08-28 16980.0 22703.26
143 2021-01-09 20439.0 26439.57
144 2021-10-02 16770.0 25637.67
145 2023-08-05 20631.0 33975.46
146 2022-09-10 17558.0 30238.48

147 rows × 3 columns

We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.

In [ ]:
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-02-13 20212.0 30147.19 2021 2 6
1 2022-08-20 16085.0 25081.16 2022 8 33
2 2021-09-11 14129.0 21345.83 2021 9 36
3 2022-04-23 18314.0 28393.05 2022 4 16
4 2022-01-08 14753.0 21325.79 2022 1 1
... ... ... ... ... ... ...
142 2021-08-28 16980.0 22703.26 2021 8 34
143 2021-01-09 20439.0 26439.57 2021 1 1
144 2021-10-02 16770.0 25637.67 2021 10 39
145 2023-08-05 20631.0 33975.46 2023 8 31
146 2022-09-10 17558.0 30238.48 2022 9 36

147 rows × 6 columns

We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook

Exponential Smoothing Modeling ¶

Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)

# Define the last date in the DataFrame for historical data
last_historical_date = forecast_features.index.max()

# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_historical_date, periods=53, freq='W')[1:]

# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index

# Function to find the best 6 months (approximately 26 weeks)
def find_best_26_weeks(forecast):
    rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=25)  # 26 weeks include the end week
    return best_period_start, best_period_end

# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_26_weeks(exp_forecast)

# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_26_weeks(exp_forecast_dollar)

# Plotting function with the best 6 months highlighted
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
    plt.figure(figsize=(14, 7))
    plt.plot(forecast.index, forecast, label='Forecast')
    plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 6 Months')
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Sales')
    plt.legend()
    plt.show()

# Plot the forecasts with the best 6 months highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 6 Months Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 6 Months Highlighted')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(

From the plot we can see that the best 26 weeks for unit sales are from november to may and dollor sales are from december to june.

In [ ]:
# Define the function to find the best 26 weeks
def find_best_26_weeks(forecast):
    rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
    best_period_end = rolling_sum.idxmax()
    best_period_start = best_period_end - pd.DateOffset(weeks=25)
    return best_period_start, best_period_end, rolling_sum.max()

# Find the best 26 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_26_weeks(exp_forecast)

# Find the best 26 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_26_weeks(exp_forecast_dollar)

# Output the best periods and total sales
print(f"Best 26 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 26 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")

# Now, let's find the values for the best 26 weeks
best_26_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_26_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]

# Print out the results
print("Best 26 weeks for Unit Sales:")
print(best_26_weeks_values_unit)

print("\nBest 26 weeks for Dollar Sales:")
print(best_26_weeks_values_dollar)

# Extracting the month names for visualization
best_months_unit = best_26_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_26_weeks_values_dollar.index.month_name().unique()

print("\nBest months for Unit Sales within the 26-week period:")
print(best_months_unit)

print("\nBest months for Dollar Sales within the 26-week period:")
print(best_months_dollar)
Best 26 weeks for unit sales start on 2023-11-26 and end on 2024-05-19, with total sales: 427813.3789549731
Best 26 weeks for dollar sales start on 2023-12-17 and end on 2024-06-09, with total sales: 838114.9938714138
Best 26 weeks for Unit Sales:
2023-11-26    18273.557368
2023-12-03    17073.675711
2023-12-10    16790.699130
2023-12-17    17610.647375
2023-12-24    17945.467759
2023-12-31    19151.167046
2024-01-07    17359.638204
2024-01-14    17887.378321
2024-01-21    17425.926358
2024-01-28    18775.194943
2024-02-04    17264.305144
2024-02-11    15921.942614
2024-02-18    13670.293916
2024-02-25    13384.690415
2024-03-03    13376.592792
2024-03-10    13738.471327
2024-03-17    14194.946141
2024-03-24    14273.816455
2024-03-31    15187.829951
2024-04-07    17385.142863
2024-04-14    16268.308741
2024-04-21    16654.632591
2024-04-28    16482.525034
2024-05-05    17690.652919
2024-05-12    16515.122418
2024-05-19    17510.753420
Freq: W-SUN, dtype: float64

Best 26 weeks for Dollar Sales:
2023-12-17    32796.052595
2023-12-24    32715.376497
2023-12-31    34628.901927
2024-01-07    31799.171033
2024-01-14    32266.722821
2024-01-21    31948.042390
2024-01-28    33843.241264
2024-02-04    31938.606478
2024-02-11    30206.685693
2024-02-18    28399.765983
2024-02-25    27729.054990
2024-03-03    27690.232296
2024-03-10    28292.725203
2024-03-17    29027.944043
2024-03-24    29664.709153
2024-03-31    30794.732258
2024-04-07    35226.660164
2024-04-14    34599.946032
2024-04-21    33930.319733
2024-04-28    34028.651332
2024-05-05    35303.518607
2024-05-12    34312.141535
2024-05-19    34744.548158
2024-05-26    34202.556449
2024-06-02    34796.812176
2024-06-09    33227.875063
Freq: W-SUN, dtype: float64

Best months for Unit Sales within the 26-week period:
Index(['November', 'December', 'January', 'February', 'March', 'April', 'May'], dtype='object')

Best months for Dollar Sales within the 26-week period:
Index(['December', 'January', 'February', 'March', 'April', 'May', 'June'], dtype='object')

The total unit sales of these products in these 26 weeks are 427813 and the dollar sales are 838114.

Lets evaluate model performance metrics

In [ ]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8)  # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}

# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()

# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))

# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)

# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 1926.6166084072836, MSE: 5642614.361837015
DOLLAR_SALES - MAE: 2929.877972609832, MSE: 10958959.979422066
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(

The MAE and MSE values for unit sales are 1926 and 5642614. For dollor sales the respected values are 2929 and 10958959.

Results ¶

From the package: 2L multijug model, we can say the best 6 months salesthe best 26 weeks for unit sales are 427813 from november to may and the dollar sales are 838114 from december to june.

6. Innovative Product ¶

Item Description: Diet Square Mulberries Sparkling Water 10Small MLT
Caloric Segment: Diet
Market Category: Sparkling Water
Manufacturer: Swire-CC
Brand: Square
Package Type: 10Small MLT
Flavor: ‘Mulberries'

Swire plans to release this product for the duration of 1 year but only in the Northern region.
What will the forecasted demand be, in weeks, for this product?

6.1 Demand forecasting on Caloric Segment, Category, Manufacturer and Brand¶

We first filter caloric segment of 'Diet', category 'Sparkling Water', brand 'Square' with 'Swire CC'

Data Preparation ¶

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_18d54420_18e977a5cdb') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT fmd.DATE,SUM(fmd.UNIT_SALES) AS UNIT_SALES, SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'DIET/LIGHT'
    AND fmd.CATEGORY = 'SPARKLING WATER'
    AND fmd.MANUFACTURER = 'SWIRE-CC'
    AND fmd.BRAND = 'SQUARE'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_18d54420_18e977a5cdb') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2023-06-10 4.0 13.56
1 2023-04-08 3.0 9.37
2 2023-02-25 4.0 10.96
3 2023-09-30 46.0 73.95
4 2023-07-15 2.0 6.78
5 2023-09-16 48.0 69.57
6 2023-07-29 3.0 10.17
7 2023-10-07 36.0 58.14
8 2023-10-14 94.0 131.73
9 2023-05-27 4.0 12.36
10 2023-03-25 6.0 18.74
11 2023-09-02 7.0 21.33
12 2023-05-06 9.0 28.51
13 2023-01-28 4.0 11.96
14 2023-04-22 7.0 22.53
15 2023-07-22 3.0 10.17
16 2023-07-01 1.0 3.39
17 2023-06-24 3.0 10.17
18 2023-04-01 6.0 20.34
19 2023-05-13 7.0 22.53
20 2023-06-17 2.0 6.78
21 2023-05-20 3.0 8.97
22 2023-01-21 1.0 3.29
23 2023-09-09 4.0 11.96
24 2023-04-15 3.0 9.77
25 2023-03-18 4.0 11.76
26 2023-03-04 12.0 35.88
27 2023-06-03 1.0 3.39
28 2023-04-29 3.0 10.17
29 2023-09-23 53.0 86.52
30 2023-10-28 83.0 138.98
31 2023-03-11 4.0 11.76
32 2023-02-18 3.0 8.97
33 2023-02-04 3.0 8.97
34 2023-02-11 3.0 8.97
35 2023-10-21 85.0 133.70
In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2023-06-10 4.0 13.56 2023 6 23
1 2023-04-08 3.0 9.37 2023 4 14
2 2023-02-25 4.0 10.96 2023 2 8
3 2023-09-30 46.0 73.95 2023 9 39
4 2023-07-15 2.0 6.78 2023 7 28
5 2023-09-16 48.0 69.57 2023 9 37
6 2023-07-29 3.0 10.17 2023 7 30
7 2023-10-07 36.0 58.14 2023 10 40
8 2023-10-14 94.0 131.73 2023 10 41
9 2023-05-27 4.0 12.36 2023 5 21
10 2023-03-25 6.0 18.74 2023 3 12
11 2023-09-02 7.0 21.33 2023 9 35
12 2023-05-06 9.0 28.51 2023 5 18
13 2023-01-28 4.0 11.96 2023 1 4
14 2023-04-22 7.0 22.53 2023 4 16
15 2023-07-22 3.0 10.17 2023 7 29
16 2023-07-01 1.0 3.39 2023 7 26
17 2023-06-24 3.0 10.17 2023 6 25
18 2023-04-01 6.0 20.34 2023 4 13
19 2023-05-13 7.0 22.53 2023 5 19
20 2023-06-17 2.0 6.78 2023 6 24
21 2023-05-20 3.0 8.97 2023 5 20
22 2023-01-21 1.0 3.29 2023 1 3
23 2023-09-09 4.0 11.96 2023 9 36
24 2023-04-15 3.0 9.77 2023 4 15
25 2023-03-18 4.0 11.76 2023 3 11
26 2023-03-04 12.0 35.88 2023 3 9
27 2023-06-03 1.0 3.39 2023 6 22
28 2023-04-29 3.0 10.17 2023 4 17
29 2023-09-23 53.0 86.52 2023 9 38
30 2023-10-28 83.0 138.98 2023 10 43
31 2023-03-11 4.0 11.76 2023 3 10
32 2023-02-18 3.0 8.97 2023 2 7
33 2023-02-04 3.0 8.97 2023 2 5
34 2023-02-11 3.0 8.97 2023 2 6
35 2023-10-21 85.0 133.70 2023 10 42

Prophet Timeseries Modeling ¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from prophet import Prophet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Converting 'DATE' column to datetime format and setting it as index
results['DATE'] = pd.to_datetime(results['DATE'])
results.set_index('DATE', inplace=True)

# Forecasting for the next 52 weeks (1 year)
forecast_period = 52

# Prophet Model
df_prophet = forecast_features.reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
prophet_model = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
prophet_model.fit(df_prophet)
future = prophet_model.make_future_dataframe(periods=52, freq='W')
prophet_forecast = prophet_model.predict(future)['yhat'].tail(52)
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/k1q59pb3.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/z86_nkut.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=77920', 'data', 'file=/tmp/tmp7x5uyqfk/k1q59pb3.json', 'init=/tmp/tmp7x5uyqfk/z86_nkut.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_model2rs8l8f7/prophet_model-20240401024313.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
02:43:13 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
02:43:14 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [ ]:
# Visualizing the forecasts
plt.figure(figsize=(15, 7))
plt.plot(future['ds'].tail(52), prophet_forecast, label='Prophet Forecast')
plt.legend()
plt.title('Prophet Forecasting')
plt.show()

Since there is no enough data it is hard to evaluate MAE and MSE scores with this data. Let's try with other combination

Results ¶

There is a peak sale in december 2023 with the prophet forecast and then it is rapidly falling down. It means the sales are higher during 'Christmas' month and then dropping later in the year.

6.2 Demand forecasting on Caloric Segment, Flavor, Non-Manufacturer and Category¶

We now filter on caloric segment 'Diet', category 'Sparkling Water', flavor 'Mulberries' with 'Non Swire-CC'

Data Preparation ¶

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_68ccdfbc_18e97945b74') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT fmd.DATE,SUM(fmd.UNIT_SALES) AS UNIT_SALES, SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
    SELECT DISTINCT zm.MARKET_KEY
    FROM `swirecc.zip_to_market_unit_mapping` zm
    LEFT JOIN `swirecc.consumer_demographics` cd
    ON cd.Zip = zm.ZIP_CODE
    WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'DIET/LIGHT'
    AND fmd.MANUFACTURER != 'SWIRE-CC'
    AND ITEM LIKE '%MULBERRIES%'
    AND fmd.CATEGORY = 'SPARKLING WATER'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_68ccdfbc_18e97945b74') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-07-10 28246.0 84693.04
1 2021-01-09 25132.0 79652.82
2 2022-04-02 22797.0 74193.46
3 2021-10-02 23514.0 70526.81
4 2022-09-24 26249.0 87931.94
... ... ... ...
143 2022-11-12 19952.0 70432.09
144 2022-05-14 26992.0 86113.05
145 2021-02-20 23433.0 72967.56
146 2022-05-07 24156.0 78188.48
147 2022-10-22 21027.0 70427.62

148 rows × 3 columns

In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-07-10 28246.0 84693.04 2021 7 27
1 2021-01-09 25132.0 79652.82 2021 1 1
2 2022-04-02 22797.0 74193.46 2022 4 13
3 2021-10-02 23514.0 70526.81 2021 10 39
4 2022-09-24 26249.0 87931.94 2022 9 38
... ... ... ... ... ... ...
143 2022-11-12 19952.0 70432.09 2022 11 45
144 2022-05-14 26992.0 86113.05 2022 5 19
145 2021-02-20 23433.0 72967.56 2021 2 7
146 2022-05-07 24156.0 78188.48 2022 5 18
147 2022-10-22 21027.0 70427.62 2022 10 42

148 rows × 6 columns

Exponential Smoothing Modeling ¶

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Converting 'DATE' to datetime format if necessary and sort
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)

forecast_features.set_index('DATE', inplace=True)

# Define the model
exp_model = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()

# Forecast 52 periods into the future
exp_forecast = exp_model.forecast(52)

# Create a new DateTimeIndex for the forecast
last_date = forecast_features.index[-1]
forecast_index = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')

# Assign the new index to the forecast series
exp_forecast.index = forecast_index

# Plot the historical and forecasted data
plt.figure(figsize=(14, 7))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Historical UNIT_SALES', color='blue')
plt.plot(exp_forecast.index, exp_forecast, label='Forecasted UNIT_SALES', linestyle='--', color='orange')
plt.title('1-Year Forecast for UNIT_SALES')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
In [ ]:
# Defining the Exponential Smoothing model for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()

# Forecast 52 periods into the future
exp_forecast_dollar = exp_model_dollar.forecast(52)

# The forecast index will be the same as for UNIT_SALES
exp_forecast_dollar.index = forecast_index

# Plotting the forecast for DOLLAR_SALES
plt.figure(figsize=(14, 7))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Historical DOLLAR_SALES', color='blue')
plt.plot(exp_forecast_dollar.index, exp_forecast_dollar, label='Forecasted DOLLAR_SALES', linestyle='--', color='orange')
plt.title('1-Year Forecast for DOLLAR_SALES')
plt.xlabel('Date')
plt.ylabel('DOLLAR_SALES')
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Split the dataset
split_point = int(len(forecast_features) * 0.8)
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]

# Fit the model on the training set
exp_model_dollar_train = ExponentialSmoothing(
    train['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()

# Forecast on the test set period
dollar_sales_forecast = exp_model_dollar_train.forecast(len(test))

# Calculate MAE and MSE using the actual and forecasted values
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)

# Print out the metrics
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used.
  self._init_dates(dates, freq)
DOLLAR_SALES - MAE: 12590.832913584723, MSE: 217388758.4560217
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(

The MAE of the model is 12590.82 and MSE of the model is 217388758.

In [ ]:
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

forecast_features.set_index('DATE', inplace=True)

# Fill in zeros for the sake of the example, replace with actual values in your data
forecast_features['UNIT_SALES'] = forecast_features['UNIT_SALES'].replace(0, method='ffill')
forecast_features['DOLLAR_SALES'] = forecast_features['DOLLAR_SALES'].replace(0, method='ffill')

# Exponential Smoothing Forecast for UNIT_SALES
exp_model_unit = ExponentialSmoothing(
    forecast_features['UNIT_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()

# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
    forecast_features['DOLLAR_SALES'],
    trend='add',
    seasonal='add',
    seasonal_periods=52
).fit()

# Generating forecasts for the next 52 weeks
exp_forecast_unit = exp_model_unit.forecast(52)
exp_forecast_dollar = exp_model_dollar.forecast(52)

# Combine the forecasts into one DataFrame
forecast_df = pd.concat([exp_forecast_unit, exp_forecast_dollar], axis=1)
forecast_df.columns = ['UNIT_SALES_FORECAST', 'DOLLAR_SALES_FORECAST']

forecast_df.head(10)  # Displaying the first 10 forecasted values
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it is not monotonic and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it is not monotonic and so will be ignored when e.g. forecasting.
  self._init_dates(dates, freq)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
  return get_prediction_index(
Out[ ]:
UNIT_SALES_FORECAST DOLLAR_SALES_FORECAST
148 27571.529482 88626.485788
149 22516.259348 72451.544244
150 19161.324780 72114.641730
151 23268.457092 86416.510320
152 21455.170171 74671.297507
153 18437.162185 67710.984217
154 22050.273942 71270.784056
155 21622.112713 83212.977492
156 19837.533935 68923.164014
157 27972.108054 93571.239394

Based on this index, we got the week following last week in the dataframe with Unit sales of quantity 27571 for Non-Swire products and a dollar sales of $88626 in that week.

Results ¶

From the plot it can be inferred that in august month the product has more unit and dollar sales. Since the competitor has more sales during that month it is advised to have more production during that month.

7. Innovative Product ¶

Caloric Segment: Regular
Market Category: SSD
Manufacturer: Swire-CC
Brand: Sparkling Jacceptabletlester
Package Type: 11Small MLT
Flavor: ‘Avocado’

Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter.
What will the forecasted demand be, in weeks, for this product?

7.1 Demand forecasting on Caloric Segment, Category, Manufacturer and Brand¶

We first filter caloric segment of 'Regular', category 'SSD', brand 'Sparkling Jacceptabletlester' with 'Swire CC'

Data Preparation ¶

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_c7a146e_18e97c1a8fd') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE CALORIC_SEGMENT = 'REGULAR'
AND MANUFACTURER = 'SWIRE-CC'
AND CATEGORY = 'SSD'
AND BRAND = 'SPARKLING JACCEPTABLETLESTER'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_c7a146e_18e97c1a8fd') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2021-01-23 70354.0 176293.12
1 2022-01-15 58533.0 149056.54
2 2022-01-22 56582.0 140421.99
3 2021-04-10 74685.0 188378.25
4 2021-06-19 82080.0 199918.61
... ... ... ...
142 2021-09-25 65465.0 166427.52
143 2023-07-01 55432.0 166191.44
144 2022-07-16 54396.0 152877.23
145 2022-02-05 56097.0 141844.16
146 2023-06-24 50273.0 155576.39

147 rows × 3 columns

In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2021-01-23 70354.0 176293.12 2021 1 3
1 2022-01-15 58533.0 149056.54 2022 1 2
2 2022-01-22 56582.0 140421.99 2022 1 3
3 2021-04-10 74685.0 188378.25 2021 4 14
4 2021-06-19 82080.0 199918.61 2021 6 24
... ... ... ... ... ... ...
142 2021-09-25 65465.0 166427.52 2021 9 38
143 2023-07-01 55432.0 166191.44 2023 7 26
144 2022-07-16 54396.0 152877.23 2022 7 28
145 2022-02-05 56097.0 141844.16 2022 2 5
146 2023-06-24 50273.0 155576.39 2023 6 25

147 rows × 6 columns

Prophet Timeseries Modeling ¶

In [ ]:
import pandas as pd
from prophet import Prophet

# Convert 'DATE' to datetime and ensure it's the index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Prepare the dataframe for Prophet's convention
prophet_df = forecast_features.reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Define holidays dataframe for Prophet, including Easter
# We add a window around Easter as additional regressors to capture the influence of the Easter period
easter_dates = pd.date_range(start='2015-04-05', end='2025-04-20', freq='A-APR')  # Rough Easter dates range
easter_df = pd.DataFrame({
    'holiday': 'easter',
    'ds': easter_dates,
    'lower_window': -14,  # 2 weeks before
    'upper_window': 14,   # 2 weeks after
})

# Initialize the Prophet model with holidays
m = Prophet(holidays=easter_df)

# Fit the Prophet model
m.fit(prophet_df)

# Create a future dataframe for predictions
# Extend into the future by the number of weeks you want to forecast
future = m.make_future_dataframe(periods=52*2, freq='W')

# Predict the future with the model
forecast = m.predict(future)

# Filter the predictions to the period around Easter 2024
mask = (forecast['ds'] >= '2024-03-17') & (forecast['ds'] <= '2024-04-28')  # 2 weeks before and after Easter
easter_forecast = forecast[mask]

# Plot the forecast
fig = m.plot(forecast)
plt.show()

# Print the forecasted values for the Easter period
print(easter_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/rotal_dm.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/h8w6598h.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=10876', 'data', 'file=/tmp/tmp7x5uyqfk/rotal_dm.json', 'init=/tmp/tmp7x5uyqfk/h8w6598h.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_model7vmk9js7/prophet_model-20240401035159.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
03:51:59 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
03:51:59 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
            ds           yhat     yhat_lower     yhat_upper
167 2024-03-17  155430.232001  143092.959012  168266.443701
168 2024-03-24  152026.638557  139249.817349  165322.131528
169 2024-03-31  151425.512991  138135.801454  164182.091479
170 2024-04-07  158150.231133  145743.115804  170100.352276
171 2024-04-14  166621.533457  154411.097177  180128.666088
172 2024-04-21  167096.278309  155286.590345  179625.531063
173 2024-04-28  157828.651397  145500.091259  170951.679379
In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Calculate the split point
split_point = int(len(prophet_df) * 0.8)

# Split the data into training and test sets
train_df = prophet_df[:split_point]
test_df = prophet_df[split_point:]

# Initialize and fit the Prophet model on the training data
m = Prophet(holidays=easter_df)
m.fit(train_df)

# Create a dataframe for predictions that covers the test set period
future = m.make_future_dataframe(periods=len(test_df), freq='W')

# Predict on the future dataframe
forecast = m.predict(future)

# Filter out the predictions for the test set period
test_forecast = forecast[-len(test_df):]

# Calculate MAE and MSE using the test set
mae = mean_absolute_error(test_df['y'], test_forecast['yhat'])
mse = mean_squared_error(test_df['y'], test_forecast['yhat'])

print(f'MAE: {mae}')
print(f'MSE: {mse}')
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/_qccyyeg.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/avm4oxtl.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=23927', 'data', 'file=/tmp/tmp7x5uyqfk/_qccyyeg.json', 'init=/tmp/tmp7x5uyqfk/avm4oxtl.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_modelwm3_cuzp/prophet_model-20240401035325.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
03:53:25 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
03:53:25 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
MAE: 15035.1263937731
MSE: 415153436.4735005

The MAE and MSE obtained from Prophet model is 15035 and 415153436 respectively.

Results ¶

The prophet model has calculated both upper and lower limits of the sales accurately. On March 17th, 2 weeks prior to Easter will have forecasted dollar sales around dollar 155430 and it dropped to dollar 152027 after 1 week. Following Easter, the dollar sales increased to dollar 158150 during April 1st week and then further continued to increase the trend with a value of dollar 166622.

7.2 Demand forecasting on Caloric Segment, Flavor, Non-Manufacturer and Category¶

We now filter on caloric segment 'Regular', category 'SSD', flavor 'Avacado' with 'Non Swire-CC'

Data Preparation ¶

In [ ]:
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
In [ ]:
job = client.get_job('bquxjob_55910a8e_18e97d03276') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand`
WHERE CALORIC_SEGMENT = 'REGULAR'
AND MANUFACTURER != 'SWIRE-CC'
AND ITEM LIKE '%AVOCADO%'
AND CATEGORY = 'SSD'
GROUP BY DATE;
In [ ]:
job = client.get_job('bquxjob_55910a8e_18e97d03276') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES
0 2023-06-17 1488999.00 5393496.05
1 2021-01-02 1598226.00 4258603.90
2 2021-06-05 1919708.00 5320678.00
3 2021-11-20 1598380.00 4779108.36
4 2021-02-27 1582854.00 4337313.37
... ... ... ...
142 2023-08-26 1391590.00 5075602.61
143 2023-06-03 1523339.00 5480932.18
144 2023-03-11 1381573.00 5116483.31
145 2022-11-26 1586792.00 5427628.42
146 2023-10-28 1281609.65 4656563.73

147 rows × 3 columns

In [ ]:
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])

# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]

# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week

# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
Out[ ]:
DATE UNIT_SALES DOLLAR_SALES YEAR MONTH WEEK_OF_YEAR
0 2023-06-17 1488999.00 5393496.05 2023 6 24
1 2021-01-02 1598226.00 4258603.90 2021 1 53
2 2021-06-05 1919708.00 5320678.00 2021 6 22
3 2021-11-20 1598380.00 4779108.36 2021 11 46
4 2021-02-27 1582854.00 4337313.37 2021 2 8
... ... ... ... ... ... ...
142 2023-08-26 1391590.00 5075602.61 2023 8 34
143 2023-06-03 1523339.00 5480932.18 2023 6 22
144 2023-03-11 1381573.00 5116483.31 2023 3 10
145 2022-11-26 1586792.00 5427628.42 2022 11 47
146 2023-10-28 1281609.65 4656563.73 2023 10 43

147 rows × 6 columns

Prophet Timeseries Modeling ¶

In [ ]:
import pandas as pd
from prophet import Prophet

# Convert 'DATE' to datetime and ensure it's the index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)

# Prepare the dataframe for Prophet's convention
prophet_df = forecast_features.reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})

# Define holidays dataframe for Prophet, including Easter
# We add a window around Easter as additional regressors to capture the influence of the Easter period
easter_dates = pd.date_range(start='2015-04-05', end='2025-04-20', freq='A-APR')  # Rough Easter dates range
easter_df = pd.DataFrame({
    'holiday': 'easter',
    'ds': easter_dates,
    'lower_window': -14,  # 2 weeks before
    'upper_window': 14,   # 2 weeks after
})

# Initialize the Prophet model with holidays
m = Prophet(holidays=easter_df)

# Fit the Prophet model
m.fit(prophet_df)

# Create a future dataframe for predictions
# Extend into the future by the number of weeks you want to forecast
future = m.make_future_dataframe(periods=52*2, freq='W')

# Predict the future with the model
forecast = m.predict(future)

# Filter the predictions to the period around Easter 2024
mask = (forecast['ds'] >= '2024-03-17') & (forecast['ds'] <= '2024-04-28')  # 2 weeks before and after Easter
easter_forecast = forecast[mask]

# Plot the forecast
fig = m.plot(forecast)
plt.show()

# Print the forecasted values for the Easter period
print(easter_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/zvdqcg5s.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/h49r5bew.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=96299', 'data', 'file=/tmp/tmp7x5uyqfk/zvdqcg5s.json', 'init=/tmp/tmp7x5uyqfk/h49r5bew.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_modelvy7xivrl/prophet_model-20240401040158.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
04:01:58 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
04:01:59 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
            ds          yhat    yhat_lower    yhat_upper
167 2024-03-17  4.682668e+06  4.429568e+06  4.939516e+06
168 2024-03-24  4.580196e+06  4.346854e+06  4.827261e+06
169 2024-03-31  4.609232e+06  4.373648e+06  4.862417e+06
170 2024-04-07  4.805525e+06  4.538350e+06  5.043012e+06
171 2024-04-14  5.003006e+06  4.749780e+06  5.255360e+06
172 2024-04-21  5.020446e+06  4.754876e+06  5.258223e+06
173 2024-04-28  4.875050e+06  4.617603e+06  5.140573e+06
In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Calculate the split point
split_point = int(len(prophet_df) * 0.8)

# Split the data into training and test sets
train_df = prophet_df[:split_point]
test_df = prophet_df[split_point:]

# Initialize and fit the Prophet model on the training data
m = Prophet(holidays=easter_df)
m.fit(train_df)

# Create a dataframe for predictions that covers the test set period
future = m.make_future_dataframe(periods=len(test_df), freq='W')

# Predict on the future dataframe
forecast = m.predict(future)

# Filter out the predictions for the test set period
test_forecast = forecast[-len(test_df):]

# Calculate MAE and MSE using the test set
mae = mean_absolute_error(test_df['y'], test_forecast['yhat'])
mse = mean_squared_error(test_df['y'], test_forecast['yhat'])

print(f'MAE: {mae}')
print(f'MSE: {mse}')
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/9_g2pknf.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/lbi5bkju.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=85973', 'data', 'file=/tmp/tmp7x5uyqfk/9_g2pknf.json', 'init=/tmp/tmp7x5uyqfk/lbi5bkju.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_model0me1f6vj/prophet_model-20240401040210.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
04:02:10 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
04:02:10 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
MAE: 508154.048327975
MSE: 431836866470.70966

The MAE and MSE obtained from Prophet model is 508154 and 431836866470 respectively.

Results ¶

The prophet model has calculated both upper and lower limits of the sales accurately for the competitor brands which has 'Avacado' flavor. On March 17th, 2 weeks prior to Easter will have forecasted dollar sales around dollar 4.68 Million and it dropped to dollar 4.58 Million after 1 week. Following Easter, the dollar sales increased to dollar 4.61 Million during April 1st week and then further continued to increase the trend with a value of dollar 4.81 Million.

Conclusion ¶

The notebook's analysis leverages advanced forecasting modeling to offer predictive insights into sales trends, with a specific focus on the best of 13 weeks or 26 weeks and 1 year and the performance of innovative products. The collaborative approach to modeling and analysis, combined with strategic insights drawn from the data, underscores the potential for data-driven decision-making in optimizing product sales and positioning in the market.

Given the constraints of directly accessing the detailed content and questions within the notebook, this approach aims to construct a generalized conclusion that reflects the analytical depth and strategic focus of the study.

Addressing questions related to strategic recommendations or business insights, the notebook's models likely emphasizes the value of accurate sales forecasting in making informed business decisions. The detailed analysis around key periods, coupled with the predictive performance of the models, offers a foundation for strategic planning, inventory management, and promotional activities to maximize sales and revenue.

Group Contribution ¶

Sai Eshwar Tadepalli - Prepared Notebook, Table of Contents, Performed Modeling of Prophet, ARIMA, SARIMA and Exponential Smoothing for Innovative Product 1,2,3,6,7 Reviewed the entire code and annotations, Used Google Bigquery to bring the data into Colab, Used Google Cloud Storage to store the data, Used tableau for EDA.

Abhiram Mannam - Performance analysis of the models using MAE and MSE. Detailed writeup of the code and descriptions of models. Performing the analysis for Innovative product 5 along with their performance analysis. Analyzing the results of the models in the results section. Producing the total sum of the sales of the products in those best performing weeks. Filtering the datasets based upon the combinations in python.

Kushal Ram Tayi - Write up of the notebook. Performing the analysis for Innovative product 4 along with their performance analysis. Proofreading the entire notebook. Research on the best models to work for the forecasting series.